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1 On Notation 

In this section, conventions for the mathematical notation used in the tutorial and supplementary material 
are discussed. In general, standard mathematical notation is employed, with some concessions to the simplified 
notations commonly used in statistics and machine learning. The general aim of the notation has been the attempt 
to balance mathematical rigor and practical applicability of the presented framework. 

Sets and Mappings 

For analytical discussions sets Sthat comprise elements sharing a set-specific property p are denoted in the 

form 

S = {s\s has the property p} (1.1) 

A function (mapping) / is generally specified in the form 

f:D -> R,x h> f( x ) (1.2) 

where the set D denotes the domain of the function /, the set R denotes the range of the function /, and 
denotes the mapping of the domain element x £ D onto the range element /(x) 6 R. In concordance with 
contemporary mathematical language, the terms 'function' and 'mapping' are used interchangeably in this tutorial. 
It is important to note that this (mathematical) notation distinguishes between the function / proper, and elements 
/(x) of its range. Usually in statements as the above, the specification of the function abbreviation, its domain and 
its range, is followed by a definition of the functional form linking the domain elements to the range elements (for 
example for a quadratic function: /(x) := x 2 ). 

Derivatives and Integrals 

The derivative of a function of a single variable is usually denoted in 'operator' form as 

where 

^fto\x=a (1-4) 

denotes the value of the derivative in x = a. The partial derivative of a function of multiple variables x u i = 1, ... , n 
with respect to variable x t is denoted as 

A /:D ^,^A /(X) (1 . 5) 

Sometimes the abbreviated bar notation for the derivative, fix), is also used. 
Integrals of a function / are usually denoted as 

Ja/(0# (1-6) 

Often, if the domain of integration is left unspecified, or implicitly understood as comprising the entire real line, the 
integral boundaries a, b are omitted. Likewise, if the variable of integration is n-dimensional, the respective n-time 
integration is left implicit, e.g. for a function /: M 3 -> M, the integral on [a 1( b-^ x [a 2 , b 2 ] x [a 3 , b 3 ] c M 3 

& .fa, 2 f( x i' x 2< x s) dx i dx 2 dx 3 (1-7) 
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is usually denoted as 



Sf(x)dx (1.8) 

Probabilistic concepts 

With respect to probabilistic concepts, the applied probabilistic notation prevalent in physics, machine 
learning, and statistics is employed. Specifically, probability spaces comprising an outcome set fl, a cr-algbra JL and 
a probability measure P, are not explicitly defined. Likewise, random variables will usually not be defined as 
measurable mappings from one measure space to another in the form 

X: (a JL) -» (fl', Jl'), a) i ^ X{co) (1.9) 

Rather, the exposition focuses on image measures, or the distributions of random variables, written as P{X) •■= 
Px(X), and generally defined by 

p(X) := P X {X £ A'} •■= P{a) £ n\X(co) £ A'} {A £ JL') (1.10) 

For example, the statement 

p(x = l) = 0.1 (1.11) 

should be read as 'the probability that the random variable x takes on a value of 1 is 0.1'. More generally, arbitrary 
values that a random value may take on are denoted by a superscripted asterisk, e.g. p(x = x*). An explicit 
distinction between probabilistic statements as frequency limits or degrees of beliefs is omitted. In concordance with 
the notations in machine learning textbooks, the explicit differentiation between probability distributions, 
probability density functions, and probability mass functions is omitted and non-capital letters are used to denote 
random variables and probabilities. Given the focus on Gaussian random variables in this tutorial, all probability 
distributions employed will actually be probability density functions, and the learning of the parameters of 
probability distributions should be understood as probability density function parameter learning. Finally, while 
some univariate examples are discussed, the theory is in general developed for random vectors, i.e. vectors of 
random variables of the form x = (x 1( ...,x d ) T where the entries x 1( ...,x d represent random variables, and d EN 
the dimensionality of the vector. This treatment subsumes the random variable case with d = 1. The most 
important aspect of this is to keep in mind that the covariances of vectors are not positive scalars as in the random 
variable case, but symmetric positive-semidefinite matrices. Occasionally, colloquial reference will be made to 
"random variables" in cases where the random vector concept is equally appropriate. 

As a notational example, the following conventions for writing bivariate and marginal probability 
distributions of random vectors x and y are used: 

p(x,y):X xy — > R+, (x = x\y = y*) >-» p(x = x\y = y*) (1.12) 

for a bivariate probability distribution, where X and y denote the ranges of x and y, respectively, and 

p(x):X — > M+, x = x* i-> p(x = x*) = f p(x = x*,y)dy (1-13) 

for the corresponding marginal distribution over x. With respect to conditional probability distributions, denoted 
here as, 

p(x|y):Xx-y -^R + , (x = x\y = y*) ~ p(x = x*|y = y*) = fgg=g (1-14) 

an important notational convention is employed to distinguish between probability distributions parameterized by 
fixed quantities, i.e. 'nonrandom' variables and probability distributions conditioned on other random variables. 
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Specifically, p#(x) is written for a probability distribution over the random variable x, which is parameterized by the 
fixed (nonrandom) variable 6. Likewise, p$(x\6) is written for a probability distribution that is conditioned on the 
random variable 8 and parameterized by the fixed (nonrandom) variable i9. Finally, in the case of multivariate 
probability distributions, conditional distributions over subsets of variables are often referred to as conditional 
marginal distributions. 

Due to the fact that the parameters of a Gaussian distribution are equivalent with its first central moments, 
namely its expectation and covariance, frequent use is made of expectation and covariance operations on random 
vectors. Here, the notations E p ( Z )(/(x)) and (/(x))p( X ) are used interchangeably to denote the expectation 
operation of a function / of the random vector x £ X under the probability distribution p: 

(■).:T(x)xP(x) -» R, (p(x),/(x)) ~ </(x)) pW := J f(x)p(x)dx (1.15) 

and 

E.C-):T(x)xP(x) -» R, (p(x),/(x)) ~ E p00 (/(*)) := J f(x)p(x)dx (1.16) 

where ^F(x) and J>(x) denote unspecified 1 function and probability distribution sets over x and the integration is 
performed over the range X of x. The reason for the use of these interchangeable notations is that (in the opinion of 
the authors) the bracketed notation (/(x)) p( - X ) emphasizes the 'operator' characteristic of an expectation slightly 
more, and is hence used mainly in algebraic manipulations, while the notation E p ( z )(/(x)) stresses the 'single 
value' characteristic of an expectation or conditional expectation slightly more and is hence used predominantly in 
the statement of results rather than calculations. Covariance operations are usually denoted by the symbol £ in the 
form 

<C(v): P(x,y) -» R+,p(x,y) <C P o, y) (x,y) := E p(Xy) ^(x - E pW (x)) (y - E p(y) (y)) ^ (1.17) 

which for a random vectors of dimensionality d EM refers to a symmetric, positive-semidefinite matrix of 
dimensionality dx d. The probability distribution subscripts in the abbreviations (x) p ^ x - ) ,E p ^(x) and £ p ( A ; j y)(x,y) 
are also often omitted, if the respective distributions is clear from the context. 

For random vectors x,y, i.e. vectors of random variables x = (x 1( ...,x d ) andy = (y x , ... ,y d ), d £ M and scalars 
a,b,c,d £ R, some standard rules for manipulations involving expectations, variances and covariances are listed 
below. 

Expectation of an expectation of a RV 

E(E(x)) = E(x) (1.18) 

Expectation of the product of a scalar and an RV 

E(ax) = aE(x) (1.19) 

Expectation of the sum of two RVs 

E(x + y) = E(x) + E(y) (1.20) 

Alternative expression for the variance of an RV 



1 The spaces T and T are left unspecified to eschew a formal discussion of measure theoretic concepts. The same reasoning is 
behind the simplified probability concept notations discussed. 
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V(x) = E(x 2 ) - E(x) 2 



(1.21) 



Alternative expression for the covariance of RVs x, y 



C(x,y) = E(xy) - E(x)E(y) 



(1.21) 



Covariance of an RVx with itself 



<C(x,x) = V(x) 



(1.22) 



Variance of a RV under scalar multiplication and addition 



V(ax + c) = a 2 V(x) 



(1.23) 



Note that (1.18) - (1.23) are readily transferred to the random vector case, if the corresponding rules of vector 
calculus are respected. 

Finally, the letters /i, land a are usually used to denote the expectation and (co)variance of Gaussian 
random variables and should not be confused with the expectation and (co)variance operations introduced above. 
Gaussian distributions are denoted by N{x;\i,T), where x indicates the Gaussian random vector, and \i, I its 
parameters (see Supplement B for details). 

Statistical concepts 

With respect to the notation of statistical concepts, we note that we will mainly be dealing with iterative 
estimation schemes. We will denote the value that an estimated parameter takes on the ith iteration of some 

iterative parameter estimation scheme is denoted by a superscript (i), such as 0®, v4®,S® . The superscript 
notation thus makes the usual 'hat' notation for parameter estimates (8, for example) redundant, which is hence 
omitted. 

Table 1 summarizes the notational conventions used in the tutorial and supplementary material. 
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Notation 


Meaning 


f:D -*R,x f(x) 


/ denotes a mapping of elements x in its domain 
D onto elements /(x) of its range R. 


^-f:D^R,x»^-f{x) 
dx' dx' 


— f denotes the derivative of f, which 

dx' 1 

corresponds to a mapping of x onto ^f(x). 


£-f:D^R,x~-£-fO$ 

dx t ' dXi 


For functions of multiple variables x x , —,x n (e.g., 
D c R n ), -£-f denotes the partial derivative of/ 
with repsect to x t . 




Integral notation a, b indicate the lower and 
upper boundaries and ^ the integration variable. 


pU,yJ:X X "y — » M + 
= x*,y - y") i-» p(x = x*,y = y*) 


■ ■ ■ 1 1 ■ 1 ■ ■ !*■•! ■* ■■* 

Joint probability distribution notation as 
prevalent in applied mathematical texts. In the 
context of this tutorial, p(x,y) usually refers to a 
probability density function. 


p\X). X — > 1K + X — X l-> 

p(x = %*) := J p(x = x*,y)dy 


nil * 1 1 !*!*■ !*■*! ■* 1 II 

Marginal probability distribution as obtained by 
integration ("marginalization") of continuous 
probability distributions. 


p(.x|y): X x y — > 1R + , (jc = x ,y = y J h 

,- *i *a P(x=x",y=y") 
p(x = x*\y = y*) •■= -j— — 

r J J jp(x,y=y , )dx 


r~v r ■ *■* r !*■* 1 1 !*!*■ !*■*! ■* 

Definition of conditional probability distibutions. 
The conditional distribution of x given y = y* 

1 j_ 1 i* 1 ■ ■ ■ • 1 * . 

equals the normalized joint probability 
distribution of x and y = y* . 


(■).:T(x)xP(x) -> R 
(v(x') f CxT) >-> (,f(xY)„r^ ■- f f(x}v(x}dx 


Notation of the expectation operation for 
functions / of random variables x. 


E.(-): fW x -> IR 
(pO),/(X)) >-+ E pW (/(%)) ■■= S f{x)p{x)dx 


Alternative notation of the expectation for 
functions /of random variables x. 


C( v ): 3>0,y) -» M + ,pO,y) >-> 

c P u,y)fey) : = E P (x,y) U * - EpwW ){y- E P ( y )(y) J 1 


Notation of the covariance for random variables 

X. 




Symbols for the parameters of Gaussian 
distributions. 


A*x|y,^x|y > °x|y 


Notation for "conditional parameters." The 

. r • , , ■ . 1 • . 1 ,1 

parameter of interest is subscripted with the 
random variable whose distribution it governs (x) 
and which is conditioned on another random 
variable (y). 




Symbols for the value of parameter estimates on 
the ith iteration of an iterative parameter 
estimation algorithm. 




£ is a positive-definite matrix 


e fc (R) 


The space of k-times continuously differentiable 
functions on E 


121 


Determinant of the matrix I 


x ~ p(x) 


x is distributed according to p(x) 



Table 1 Notational Conventions of the Tutorial and Supplement 
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2 Some Properties of Gaussian and Gamma Distributions 

The analytical properties of multivariate Gaussian distributions are of central importance for the VB 
approach for LGSSMs In this section, we summarize some of these properties in theorem form. Additionally, an 
overview is provided in Table 2 and Figure 1. 

Definition and central moments of the Gaussian 

A Gaussian distribution over a random vector, i.e., a vector of random variables, x = (x 1( ...,x d ) T £ R d is 
defined by the probability density function 

P„,zM ==N(x;/i,I) = (27r)^|Z|4exp(-i(x-M) r 2-Hx-M)) (2.1) 

Here, pL £ R d and the positive definite matrix Z £ U. dxd , Z > are the fixed parameters of the distribution as 
indicated by the subscript and semicolon notation in p M j;(x) and N(x; pi, Z), respectively. If pi and Z are themselves 
random variables, the statement above is written p(x\pi, Z) = N(x\pi, Z) and takes on its parameterized form only 
for concrete values of the random variables pi = p.* and Z = Z*, i.e. p{x\pi = /i*,Z = Z*) = N{x; pi*,!.*). For the case 
of a univariate random variable x £ R , the probability density function is usually expressed as 

P^,o 2 00 : = N(x; a 2 ) = (27ra 2 )"2 exp (- ^ (x - //) 2 ) (2.2) 
with pL £ M and a 2 £ R + . The first central moment or the expectation of normal distribution is given by 

i.e., the parameter [i £ R d of a multivariate normal distribution N(x; /i, 2) is equivalent to the expectation of the 
identity function of x under the probability distribution N(x; pi, 2). The second central moment or the covariance of 
a normal distribution is given by 

C P^w(.x,x) = Ep^frj ^(x - E Pm2W (x)) (x - E Pf(2W (x)) j = I (2.4) 

i.e., the parameter I £ R dxd , Z > of a multivariate normal distribution N(x; pi,I.) is equivalent to the covariance 
of the identity function of x under the probability distribution N(x;pi, Z). These two equivalences allow the 
expectation of the square of x under p^ 2 (x) to be expressed in terms of the expectation and covariance of x under 
p Mj2 (x), an important property, which is exploited on several ocassions in later sections. Specifically, because of 

V(x) = E(x 2 ) - E(x) 2 (2.5) 

one has for normally distributed random vectors x £ R d : 

E P^M(xx T ) = Vp M2(x) (x) + Ep M2W (x)Ep MzW (x) T (2.6) 

= ( xxT ) PflX (x) + ( x )p M ,zW^ x )p M ,zW 
= Z + nn T 
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The Completing-the-square theorem for the Gaussian distribution 

The completing-the-square theorem states that if x E R d ,b E U d ,A E U dxd ,A > 0, then 

exp (- \x T Ax + b T x) = N(x\A- 1 b,A- 1 )yJ\0.nA)- 1 | exp (±b T A~ l b) (2.7) 

The completing-the-square theorem allows identifying the parameters of a Gaussian probability density function of 
x E R d based on a quadratic form x E R d , given by the argument of the exponent on the left hand side of the 
theorem. 

o It is verified by substitution of N(x; A~ 1 b, A -1 ) on the the right-hand side of the identity above as follows: 

N(x;A- 1 b,A- 1 )J\(2nA)- 1 | exp {^b T A~^b) (2.8) 

- Urn exp (-; ( * - A ~ lbrMx ~ A ~ lb) + i» T "-'») 

= exp(-±(x T Ax-x T AA- 1 b-b T A- lT Axb T A- lT AA- 1 b) +\b T A~ 1 b) 
= exp (- \ (x T Ax -x T b- b T A~ x Ax + b T A~ r b) + \ b T A~ x b) 
= exp (--x T Ax + -b T x + -b T x - -b T A~ x b + -b T A^b) 

^ \ 2 2 2 2 2 / 

= exp (—^x T Ax + b T x^ 

□ 

The Linear transformation theorem for the Gaussian distribution 

The "linear transformation theorem" states that if x ~ N(x; ij. x ,T. x ), where [l x E £ R dxd ,I. x > 

e~N(e; n e , I £ ), where s, yc E E U d , I E > 0, C(x, e) = (C(e, xjf = £ U. dxd and A E R dxd is a matrix, then 

Ax + s = :y~jV(y ;j u y ,I y ) (2.9) 

where y E R d , \i y E R d , I y E R dxd , I y and specifically, 

fly = Afi x + fi £ and I y = AZ X A T + S e (2.10) 

The proof that y is indeed a Gaussian random vector capitalizes on the transformation theorem for probability 
density functions and is suppressed here for brevity. 

o If it is acknowledged that y is a Gaussian random vector, the expressions for the parameters of this Gaussian can 
be derived in an instructive manner from the fact that the parameters of a Gaussian correspond to its first two 
central moments and the general rules for manipulating expectancies and covariances as follows: 

Hy := E(y) = E(Ax + s) = E(Ax) + E(e) = AE(x) + E(e) = A\l x + n E (2.11) 

and 

I y =:E((y-E(y))(y-E(y)) r ) (2.12) 



= IE {(Ax + e - E{Ax + e)){Ax + £- E(Ax + s)) T ^j 
= IE {(Ax + £- AE(x) - E(e))(Ax + s- AE{x) - E(») T ) 

= E (^{a(x - E(x)) + (e - E(e))) {a(x - E(x)) + (e - E(»)) T 

= E {{a(x - E(x)) + (e - E(e))) ((x - E(x)) T A T + (e - E(») T )) 
= AE {(x - E(x))(x - E(x)) T ) A T + AE {(x - E(x))(e - E(») T ) 

+E {(e - E(£))(x - E(x)) T )^ T + E {(e - E(e))(e - E(») T ) 

= AC(x, x)A T + AC(x, e) + £(e, x)A T + C(s, e) 
= ^I X ^ T + X • + • A T + S e 
= ^I X ^ T + 1 £ 

where in the penultimate line the independence property (C(x, s) := of x and s was exploited. 

□ 

The Marginalization and conditioning theorem for the Gaussian distribution 

The marginalization and conditioning theorem states that, if z~ N(z; [i, I), z, [i £ R d , I £ R dxd , I > and 

z := Q , x £ R m , y £ M d - m (2.13) 
with corresponding partitions of the parameters and precision matrix A = I -1 £ R dxd , i.e., with 

_ (^ x \ ^ _ fixx ^xy\ anc j ^ _ ^ xy \ ^ 

VI VZ Z vv / \A VX A VV J 



then 



Further, 



where 



o Preliminaries 



p{x) = N(x;n x ,X xx ) (2.15) 



p(x\y) = N(x;n xly ,Z xly ) (2.16) 



Mx|y — "I" ^"xy^yy (y My) anc ' ^x|y — ^xx ^xy^yy^yx (2-17) 



To verify the marginalization and conditioning theorem, we follow (Bishop, 2007). The expressions for the 
conditional mean and conditional covariance can be verified using two intermediate steps. We first obtain some 
helpful relations between the elements of matrices I and A. Using the fact that A = I -1 we have: 
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(s z Ha a ) = / (2 - 18) 

Based on (2.18) one can infer the following identities 



^■xx ^xy^yy^yx ^xx (2-19) 

^•xx ~ ^-xy^yy^yx = ^xx (2.20) 

and 

AyyAy^. Ey X Y, xx (2.21) 

We next consider the joint probability distribution 

p(x, y) = p(z) = N(z; \i, I) (2.22) 

and define 

a := (x - n x ) and b ~ (y - n y ) (2.23) 
Then the quadratic form in the exponent of p(x,y) can be rewritten as: 

-fr-*m-*> = -fr »(t v)0 <"«> 

= — ^{a T A xx a + a T A xy b + b T A yx a + b T A yy b) 

~ ^CL ^^.j^^CL I CI A.^yA.yyA.yyb I Z? ^^.yy^^yy^^yj^CL I Z? A.yyb^ 

= -\a T {A xx - A xy A yy A yx )a -\{b + A yy A yx af A yy {b + A yy A yx a) 
Further, using the relations (2.18) - (2.20), (2.24) can be rewritten as either 

-\(?- M) r E _1 - j") = - \(x - ii x y^{x -H x )-\{y-Hy- ^ yx ^xx(.x - fi x )j A yy {y - fi y - Y. yx Y. xx (x - fi x )j 

(2.25) 



or 



M) r ^ *0 ~ H) = ~ \{x - H x - Z xy -L yy ~(y - fiyfj A xx (x-fi x - ^ xy ^ yy {y - n y )) - \{y - fi y ) T Y- y ^{y - n y ) 

(2.26) 

□ 

o Verification of (2.15) 

To obtain the marginal distribution p(x), the joint distribution p(x,y) has to be integrated over y. To this 
end, we make use of (2.25) 

p(x) = Jp(x,y)dy (2.27) 

= (2n)~2 |sp exp (- \ (x - fi x ) T I. xx -(x - fi x )^ 
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J^-*-«.-*)XCr-*-W*-*)))* 



d 



1 



= (2tt)~2 |zp e -^-^ ^^ x_ ^(27r)^|A yy | 2 
= iV(x; 

This provides verification of the statement (2.15). 

□ 

o Verification of (2.16) 

We now consider the joint probability distribution p(x,y = y*), which, upon normalization, corresponds to 
the conditional distribution p(x|y = y*) via 

pC*-y)-^^ (2-28) 

Rewriting the unnormalized conditional distribution using quadratic form (2.26) and summarizing all terms 
independent of x independent terms in a constant C, one obtains 

p(x\y = y*) = C ■ exp {^-^(x-n x - E %y Z yy -(y* - Hyj) A xx (x~n x - E xy Z yy -(y* - /i y ))^ (2.29) 

Implicitly assuming appropriate normalization we see that conditional distribution has a Gaussian form with 
parameters 

^x\y* := ^xx = ^xx ~ ^xy^yy^yx (2.30) 

and 

l*x\y* := l*x + ^-xy^yyiy* ~ My) (2.31) 
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Figure 1. Visualization of properties of the Gaussian distribution. Figure 2A visualizes the completing-the-square theorem. The 
left-hand panel depicts three different exponential quadratic forms in x with parameters a and b as noted in the legend. For the 
same parameter settings, the right-hand panel depicts the evaluation of the first multiplicative term of the right-hand side of the 
completing-the-square theorem, i.e., the Gaussian form with parameters )x — a _1 b and a 2 = a -1 as dashed curves. The 
continuous curves represent these forms upon multiplication with the normalization factor, rendering the ensuing curves 
equivalent to the exponential quadratic forms in x depicted in the left-hand panel. Figure 2B visualizes the linear transformation 
theorem both on the analytical and realized level. The left-hand panel depicts the functional forms of two Gaussian distributions 
over variables x and e as red and grey continuous curves, respectively. Additionally, a histogram estimate of both density 
functions based on N — 1000 samples is depicted as a bar graph. The right-hand panel depicts the ensuing functional form of 
the distribution of the random variable x + s as well as the result of a histogram estimate of the addition of the samples 
obtained for the left-hand panel. Figure 2C visualizes the Gaussian marginalization and conditioning theorem for a bivariate 
Gaussian distribution over z = (x,y) T . The leftmost panel depicts the functional form of the joint distribution over x and y. The 
marginal distributions resulting from the application of the marginalization equations are depicted in the second panel from the 
left. For values y = 1 and y — 2, the third panel from the left depicts the conditional marginal distributions p(x\y) as dashed 
and dotted lines, respectively; likewise, for values x — 1 and x — 2, the rightmost panel depicts the conditional marginal 
distributions p(y\x) as dashed and dotted lines, respectively. 



12 



Definition and notation of the multivariate Gaussian distribution for x G Mr 

Putto -N(x;/i,T) = (27r)^|Sr5exp(-i(x- i u) T I- 1 (x- jU )) 
Definition and notation of the univariate Gaussian distribution x £ M 

P^,o 2 CO : = N(x; V-° 2 ) = (27ra 2 )"2 exp (- ^ (x - //) 2 ) 
Expectation and covariance equalities and notation for multivariate Gaussian distributions 

C P^w(.x,x) = Ep^frj ^(x - E P(aW w) (x - Ep^ w M) j = E 

General variance property and application to the multivariate Gaussian distribution 

V(x) = E(x 2 ) - E(x) 2 => E Pfix(x) (xx T ) = <C PfizW (x,x) + E PwM (i)E P(aM (x) T = I + nn T 
"Completing-The-Square Theorem" 

The completing-the-square theorem states that if x G M d , b G M d , A G M dxcl , 4 > then 

exp(-^x T ,4x-Fx) = N(x; A~ x b, A' 1 ) J \ (InA)' 1 | exp Qfe^ -1 ^ 
"Linear Transformation Theorem" 

The "linear transformation theorem" states that if x ~ N(x; jU x ,I x )< where G E d ,2 x G 
M dxd ,I x > 0e~ N(E;n e ,X e ), 

where e, pL E G M d , S £ >- 0, <C(x, e) = (£(e, x)f = G R dxd and 4 G M dxd is a matrix then 

Ax + s = :y~ N(y;fj. y ,Z y ) 
where y G E d , /i y G M d , I y G R dxd , I y and specifically 

fiy = A[L X + [L £ and I y = AT. X A T + 2 £ 
"Marginalization and Conditioning Theorem" 

The marginalization and conditioning theorem states that if z~ N(z; fi, I), z,\i G M d ,T. G M clX£l ,I > 
Oand 

z := Q, x G M m ,y G E d_m 
with corresponding partitions of the parameters and variance matrix I G M dxd , i.e., with 

(l i x\ , _ f^xx ^xy\ 
%) andI = U* i yy j 

then 

p(x) = iV(x; and p(y) = iV(y; /i y , I yy ) 

Further, 

p(x|y) = iV(x; j^iy, Z X | y ), where /i x , y = + ^ xy ^yy(y ~ My) and Z X | y = I xx - I xy 2 y ^ %y 



Table 2 Some Properties of the Gaussian distribution 
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The Gamma distribution 

The Gamma density is considered in its "shape and scale" parameterization given as 

where % is referred to as the "shape parameter" and b% is referred to as the "scale parameter". The expectation and 
variance of A under G{A; a x , b{) are expressed in terms of the parameters as 

E C (A;o,6)U) = ah and V G (A;a,6)U) = ab 2 (2.33) 

The expectation of the logarithm of A under G(A; a, b) is given by 

E G(A;a , 6) 0nA) = i/'(a)+ln6 (2.34) 

where xfj denotes the digamma function. The KL divergence between two gamma densities G q ■= G{A; a q ,b q ) and 
G v := (A; aP, b v ) is given by (Penny, 2001) 

3CL(G« \\G P ) = (a q - l)ifj(a q ) - In b« - a q - In r(a") + In r(aP) + aP In b? - (a? - l)(ifj(a q ) + In b«) + ^ 

(2.35) 
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3 Wiener Processes, Stochastic Integration, and Euler Maruyama Discretization 

The aim of this section is to review a minimum set of concepts from stochastic analysis which forms the 
background for the embedding of (discrete-time) LGSSMs in the context of (continuous-time) stochastic differential 
equations as discussed in Sections 2.1 and 5 of the main tutorial. Supplement Section 3 is not meant to provide a 
comprehensive discussion of stochastic analysis, but rather to collect some pointers to aspects of stochastic analysis 
the interested reader might pursue. Additionally, we hope to provide some interpretation for equations (1) -(6) of 
Section 2.1, as well as equations (36) - (37) of the tutorial example in Section 5. 

Wiener processes 

In this section, the intuition and mathematical formulation of Wiener processes is briefly sketched. For a 
comprehensive review of stochastic processes in general and Wiener processes in particular, the reader is referred 
to (Bass, 2011). In the context of the tutorial, Wiener processes form the mathematical model of random 
fluctuations and the latent level (ref. equation 1 of the main tutorial), which, by means of Euler-Maruyama 
approximation, results in the Gaussian state space evolution of the LGSSM (ref. equation 2 of the main tutorial). 
Here, we proceed by introducing the definition of general stochastic processes, the physical intuition of Wiener 
processes, the definition of Wiener processes as stochastic processes, and finally some properties of Wiener 
processes required for the discussion of stochastic integration. 

General Stochastic Processes 

Informally, a general stochastic process can be regarded as a set of random variables (X t ) t€T indexed by an 
uncountable index set Tci + that models continuous time. Formally, a general stochastic process is defined as 
follows: Let (fi, cA, P) be a probability space and T an arbitrary nonempty set. Let X t : (ft, JV) -» (X t , ® t ) be a (c/Z- 
S t )- random variable for every t £ T. Then 

X T :=(Cl,A,P,(X t ) teT ) (3.1) 

denotes a general stochastic process with parameter set T. 

The uncountability of the parameter set T entails some mathematical difficulties for the definition of the 
outcome set ft of a stochastic process, its corresponding cr-field <A, and the existence of the probability measure P. 
These difficulties can be resolved using Kolmogorov's theorem (Bass, 2011), which allows for proofing the existence 
of a stochastic process based on a consistent set of finite marginal distributions. In other words, the definition of the 
Wiener processes based on normally distributed increments given below can be related back to the general 
definition of stochastic processes by means of Kolmogorov's theorem. Below, we often use the simplified notation 
X(t) for a general stochastic process. 

Physical Intuition of Wiener Processes 

It is helpful to consider the physical interpretation of a Wiener process to motivate the formal definition 
below. In its physical interpretation, a realization of one-dimensional Wiener process can be regarded as the 
projection of the displacement vector of a three-dimensional particle in a fluid onto a selected coordinate axis of M 3 
over time. Specifically, X = would denote the particle centered on its starting point and X t the projection of its 
displacement from the starting point at time t onto e.g., the x-axis. Importantly, the particle is considered large with 
respect to the molecules constituting the fluid. In principle, based on a set of initial conditions, the movement of all 
molecules in the fluid is determined by Hamiltonian (i.e., generalized Newtonian) dynamics. However, because the 
number of molecules is conceived as very large, a statistical or probabilistic view is appropriate. The molecules of the 
fluid are imagined to collide with the large particle at a time scale that is short with respect to the time scale that the 
particle is observed on. In this sense, the movement of the large particle in a time interval [t 1( t 2 ] can be imagined as 
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being the result of the summation of a large number of independent fluid-molecule-particle collisions, each evoking 
a minute movement of the particle. Informally, the central limit theorem of probability theorem states that the sum 
(or average) of a high number of independently distributed random variables (here the minute motions of the 
particle evoked by each molecule-particle collision) is normally distributed. Thus, displacement in X, i.e. the 
difference X t2 — X ti can be considered to be normally distributed, and, because the molecule impacts are conceived 
as independent and evoking motion in all spatial directions, have an expectation of zero, i.e. E(X t2 — X t _J = 0. In 
other words, most likely, due to the complementary nature of the summed impacts between t x and t 2 , the particle 
stays were it is. Further, if the characteristics of the fluid do not change over time (e.g. the fluid is not heated up, 
causing stronger impacts of the molecules), the distribution of the spatial distance in a time interval [t 1( t 2 ] should 
be the same as the distribution of the spatial distance in a time interval [t x + h,t 2 + h], with h > 0. Finally, the 
particle motions evoked by the impact of fluid molecules are considered to be independent over disjoint time 
intervals. 

Definition of Wiener Processes 2 

The intuition of a Wiener process realization as a large particle colliding with many small fluid molecules at a 
short time scale with respect to the observation time scale of the movement of the large particle yields the following 
set of formal requirements for a mathematical model (ft, JL, P, (X t ) teT ) of Wiener processes: (ft, JL, P, (,X t ) tET ) is a 
stochastic process with parameter set T = E+, for which the following requirements hold: 

(1) Initial value X = P-almost everywhere (i.e., if there exists N c ft for which X (u) N ) 0, co N £ N, then 
P(N) = 0. In other words, P{X = 0} = P({a) £ a\X ((o) = 1}) = 1). 

(2) Normality For every t > 0,X t is normally distributed with expectation and variance parameter v(t). Here 
u(-) denotes the variance of the normal distribution of X t as a function oft. 

(3) Stationarity and Independence (ft, JL, P, (X t ) t€T ) is a stochastic process with stationary increments, i.e. the 
distribution Px t -x of the increments X t — X s (s, t £ T, s < t) only depends on the difference t — s, but not 
on t or 5 itself. In other words, for h > 0, we have Px t+h -x s+h = ^x t -x s - Note that with X t and X s also the 
increment X t — X s is a random variable. Furter, (ft, <A, P, (X t ) teT ) is a stochastic process with stochastically 
independent increments, i.e. for all / = {t x , ...,t n } £ K(T) (where K(T) ~ {J\J c T,0 < \J\ < oo}) with 
n > 2 and t x < ••• < t n the random variables (increments) X t — X t , —,X tn — X t are stochastically 
independent. 

Note that, formally, the mathematical framework of stochastic processes requires an existence proof of the 
postulated probability measure P on (ft, JV) , which can be achieved by showing the consistency of the set of 
marginal distributions defined above. We eschew this existence proof here and instead note some informal 
properties of Wiener processes required for the concept of stochastic integration. 

Properties of Wiener processes 

We first reformulate the Wiener process without reference to the underlying measure space and, from now 
on, denote the Wiener process by W(t), t £ [0, T] (Hassler, 2007). Informally, W(t) is a process with a starting value 
of 0, and independent, normally distributed, stationary increments. More formally, we can restate (1) - (4) in the 
definition of the Wiener process above as follows 

(1) Initial value The initial value of the Wiener process is zero with probability 1, i.e. P(W(0) = 0) = 1. 

(2) Normality The increments W^) - W(t ), W(t n ) - W(t n _{), where < t < t x < ••• < t n are 
independent for n £ M. 



2 In the following section, |5| denotes the cardinality of a set S, not a determinant. 
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(3) Stationarity and Independence The increments W(i) — W(s) are normally distributed with expectation 
and variance equal to the temporal difference s — t for < s < t, i.e. W(i) — W(s) ~ N(W(t) — 
W(s); 0, t — s). Notably, the variance of the increments is stationary, i.e. it only depends on the difference 
t — s, but not the absolute values of s and t, and the joint distribution of the increments is multivariate 
normal with a diagonal covariance matrix. 

The Wiener process is defined by the properties of its increments. Nevertheless, the Wiener process can, inituitively, 
be conceived as a stochastic function of the form W: [0,T] -* R, t *-* W(i) where (t) is distributed according to a 
normal distribution N(W(t); 0, t). This is readily seen by considering the distribution of the increment W(t) — 
W(0) for t £ [0, T]. According to (3) W(t) - W(0) ~ N(W(t) - W(0); 0, t - 0) and according to (1) W (0) = 
with probability 1. Based on the properties of normal distributions, we can thus infer that E(W(tj) = and 
V(W(t)) = t. 

Gaussian Processes 

Equivalently, a Wiener process W(i) can be conceived as a Gaussian process. A Gaussian process is a 
stochastic process (X t ) tET with a countable index set T •■= {t 1( ...,t n }, t 1 <t 2 < ■■■ < t n , where |T| = n, n £ M, 

T 

which is multivariate normally distributed. In terms of Gaussian processes, the vector W(t) •■= (W^t-^), ...,l/K(t n )) 
is distributed according to N(W(t); 0,2), where the entries of the covariance matrix I £ IR nxn , > I are given by 
= min(t(,t / ) (for a proof, see e.g. (Hassler, 2007)). Note in particular, that the diagonal elements of this 
covariance matrix are increasing and that the off-diagonal elements are non-zero, i.e. modeling stochastic 
dependencies. 

Stochastic integration 

Stochastic integrals, in contradistinction to the integrals known from standard calculus and analysis, are 
random variables, implying that the values they assume display some "randomness". The aim of this section is to 
introduce the Ito integral of a stochastic process as the prerequisite for the framework of stochastic differential 
equations. To help with its introduction, the Ito integral is developed as the endpoint of a "stochastification" of the 
Riemann integral known from standard calculus. The development of this section is largely based on (Hassler, 2007), 
Sections 6 - 11. 

Convergence in the quadratic mean 

In this section, we lay the foundations for the definitions of stochastic integrals as the limits of their 
corresponding sum expressions, in analogy to the well-known quadrature expressions for Riemann integrals in 
deterministic analysis. To this end, we require a convergence notion for random variables and discuss the supporting 
interval partition used in the subsequent sections. 

The convergence form required in the definition of stochastic integrals below is "convergence in the 
quadratic mean". A sequence of random variables (X^n^ is said to converge to a random variable X in the 

2 

quadratic mean, if E((Z n — Z) 2 ) -» for n -> oo , which will be denoted as X n -» X. Without proof, we state the 
following (Cauchy) convergence criterion: A sequence of random variables with E(X%) < oo converges in the 
quadratic mean, if for arbitrary n and m, E((X m — X n ) 2 ) -> for m, n -> oo, or, equivalently, if for arbitrary n and 

m, E(X m X n ) -> c < oo for for m, n -> oo, c £ R. 

The concept of convergence in the quadratic mean can be demonstrated using a simple example: Consider 
the sequence (X n ) neM of random variables defined by X n •■= -2f=i^i f° r n = 1/2, ... where the Y it i = 1, ...,n are 
independent, normally distributed random variables with mean and variance a 2 , i.e. Yi~N(Yi; 0, a 2 ) for 
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i = 1, ...,n. Using the Cauchy convergence criterion above, it can be readily shown that this sequence converges in 
the quadratic mean (assuming E(X%) < oo) as follows 

E(X m X n ) = E fei Y t ■ -Zr-i Y t ) = ^TT^ ^Of ) = ° 2 - 

(3.2) 

for n,m -» oo. in the above, the second equation follows, because the expectations of the products Y t ■ Yj for i 
j are 0, as the Y t , i = 1,2, ... are assumed to be stochastically independent. 

Interval partitions 

In order to be able to define the integral of a function on the interval [0, T] c M, the interval [0, T] is 
partitioned into disjoint subintervals of the form 

Sn([0, ?1) == [t , tj] U [t v t 2 ] U ... U [t n _ x , t n ] (3.3) 

where = t < t x < t 2 < ••• < t n -i < t n = T. It is assumed that this partition becomes more and more fine 
grained as n increases, in other words: max 1 <j< n (t( — t^) -» for n -» oo. A typical example of a partition that 

fulfills this criterion is the equidistant partition of [0, T], given by t t = — (i £ M n ), for which one has 

T 

which obviously fulfills lim n ^oo - = 0. We note that for any deterministic function /: [0, T] -> R, 

Sf =1 /(t £ ) - /(t t _i) = /(t x ) - /(t ) + /(t 2 ) - /(t x ) + - + /(t n ) - /(t n _!) (3.5) 
= /(t n )-/(t ) 

= f(T)-no) 

and thus, for the identity function, we have 

Z?=iti-ti-i = r-o = r (3. 6 ) 

An arbitrary point of support in the interval [t(_ 1( t;] will be denoted by tj*. 

The stochastic Riemann integral 

Informally, stochastic Riemann integrals are the usual Riemann integrals with a stochastic process being part 
of the function integrated over. If the stochastic process in question is a Wiener process, the stochastic Riemann 
integral is normally distributed with an expectation of zero and a closed form equation for its variance can be 
obtained. Formally, the general stochastic Riemann integral over the product of a deterministic function / and a 
general stochastic process Z(t) is defined as the limit of the Riemann sum 

R n = Y?=im)mm-ti-i) (3.7) 

If the limit of this sum convergences in the quadratic mean for n -> co independently of the form of the partition 
S n ([0, T]), this limit is defined as the stochastic Riemann integral: 

f r /(s)Z(s) ds := linw Z? =1 /(t*)*(t*)(t; - t^) (3.8) 
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We next consider stochastic Riemann integrals, for which the stochastic process is a Wiener process. For a 
deterministic function /, the stochastic Riemann integral of the product between / and a Wiener process W(i) 
corresponds to a normal distribution with expectation 

e(J" T Ks)W(s)ds) = (3.9) 

and covariance 

C (Sof(sW(s) ds, fif(s)W(s) ds) = / T /(s)/(t) minO, t) dsdt (3.10) 

(for a proof of the above, see (Hassler, 2007)). 

The Riemann-Stieltjes integral 

Riemann-Stieltjes integrals are solutions for specific stochastic differential equations and are normally 
distributed random variables. The integrand of a Rieman-Stieltjes integral is a deterministic function /, however, this 
is integrated with respect to a Wiener process, which defines the Riemann-Stieltjes sum as follows 

Rs n = £?=i/(tnont £ ) - w^)) (3.ii) 

As for the stochastic Riemann integral, if the limit of this sum for n -* co exists in the quadratic mean sense 
independently of the form of the partition S n ([0, this limit is defined as the Riemann-Stieltjes integral 

/ T /(s) dW{s) == lim n _ E? =1 /(tH0nti) - W(t t _0) (3.12) 

As a result from being the limit of a sum of normally distributed random the Riemann-Stieltjes integral is normally 
distributed with expectation 

E(S*fCs)dW(s)) = (3.13) 

and covariance 

c(^ns)dW(s), S*f(s)dWCs)) = fif 2 (s)ds (3.14) 

(for a proof of the above, see (Hassler, 2007)). To see that for the Riemann-Stieltjes integral of the constant unity 
function 

dW(s) = W(t 2 ) - WCtJ (3.15) 

we consider its relation to the concept of partial integration as known from the calculus of standard Riemann 
integrals. Recall that, for two deterministic integrable functions /: [0, T] -> R and g: [0, T] -> R, the following rule of 
integration holds 

f f{t)g'{t)dt = f(t)g(t)\ T - J T f'WgWt (3.16) 

For the Riemann-Stieltjes integral, one has (changing the upper integral bound from T to t £ [0, T], and the variable 
of integration from t to s, for a proof see (Hassler, 2007)) 

J*f(s)dWts) = /(s)W(s)|S - J V(s)d/(s) (3.17) 

which can be reformulated as 
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S^mdWis) = f(t)W(t) - f*W(s)f'(s)ds (3.18) 

Finally, we consider the special case of the Riemann-Stieltjes integral of the unity function /(t) := 1. Substitution 
into the equation above yields 

/„' dW{s) = W(f) - W(0) - £ W(s) -Ods = W(f) (3.19) 
More generally, on the interval [t 1( t 2 ] <= [0, T], one thus has 

£ dW(s) = WW* - ill W{s) -Qds = W(t 2 ) - Wit,) (3.20) 

The Ito integral 

The Ito integral provides the general basis for stochastic differential equations as discussed below. Here, we 
briefly discuss its definition and some of its properties. The general Ito integral of a stochastic process Z(t) with 
respect to a Wiener process W(i) is defined as limit of the Ito sum 

i n ■■= i,um-i)wcti) - w(t £ _ x )) (3.2D 

where, importantly (and in contradistinction to other stochastic integrals, as e.g., Stratonivich integral) the point 
tj-i, i.e., the left border of the partition interval [ti_ 1( t{\ serves as supporting point. If X(i) is a stochastic process 
with finite variance, and if X(t) is independent of the future of the stochastic process, the Ito sum converges to a 
well-defined limit in the quadratic means sense and is defined as the Ito integral 

f*X(t)dW(t) == lim^If^XCt^XOnti) - W(t t _!)) (3.22) 

The Ito integral is a random variable and its first two central moments can be shown to be given by (for a proof see 
(Hassler, 2007)) 

E(/ T Z(t)dW(t)) = (3.23) 

and 

C (jJ^Ct) dW(t),fix(t) dW(tj) = J T E(Z 2 (t)) dt (3.24) 

In the degenerate case that the stochastic process Z(t) corresponds to the unity function, i.e., that X(t) •■= 1 with 
probability P(Z(t)) = 1, the Ito integral evaluates to the Riemann-Stieltjes integral discussed in the previous 
section, which in turn evaluates to the increment of a Wiener process (as seen above, again changing the upper 
integral bound from T to t 6 [0, T], and the variable of integration from t to s): 

f*X(s) dW(s) = f< 1 dW(s) = Wit) (3.25) 

Stochastic Differential Equations 

Diffusions and Stochastic Differential Equations (SDEs) 

A diffusion is a stochastic process comprising the sum of a stochastic Riemann integral and an Ito integral in 
the form 

X(t) = X(0) + S*[i(X(is),s)ds + J*a(X(s),s)dW(s) (3.26) 
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where Z(0) indicates the starting point of the stochastic process X. Here, the functions fi and a are referred to as a 
drift function and volatility functions, respectively, and are functions of time and the diffusion process itself. The first 
integral term, a stochastic Riemann integral, models a deterministic drift of the process X, while the second integral 
term, an Ito integral, models a random component. Taking the derivative with respect to time on both sides of the 
diffusion equation above yields its differential form 

±X(t) = £(X(0) + J>(*0),s) ds + fiaCXCsls) dW(sj) (3.27) 
which usually denoted as 

dX(t) = n(X(t), t)dt + o{X(t), t)dW(t) (3.28) 

Informally, the equation above states, that the change of the process X within an "infinitesimal" small time 
interval at time point t is given as the sum of (1) the product of a deterministic function [i (depending on the state of 
the process X at time point t and time itself) with the "infinitesimal" length of the time interval dt and (2) the 
product of a deterministic function a (also depending on the state of the process X at time t and time itself) with the 
"infinitesimal" small increment of a Wiener process at time t, which is randomly distributed. The classification of 
stochastic differential equations (SDEs), as well as the study of existence and uniqueness properties of their 
solutions, forms an integral part of stochastic analysis. The interested reader is referred to (Kloeden & Platen, 1999) 
in depth analytical and numerical treatments of SDEs. Here, we briefly introduce the set of narrow-sense linear SDEs 
with the aim of introducing the Langevin equation as provided in equation (4). 

Autonomous Narrow-sense linear SDEs and the Langevin equation 

An SDE of the type 

dX{t) = fi{X(f), t)dt + o(X(t), t)dW(t) (3.29) 

is called a linear SDE, if the functions |i:lx [0, T] -» M and a:Rx [0, T] -> R take the form 

H(X(t), t) := jU!(t);r(t) + n 2 {t) and o(X(t), t) ~ a^Xit) + o 2 (t) (3.30) 

for coefficients fi v \i 2 , a x , o 2 : [0, T] -* DL If all coefficients are constants, the linear SDE is called autonomous. 
Further, if fi 2 (t) = o 2 (t) •= 0, the linear SDE is referred to as homogenous. Finally, the linear SDE is referred to as 
linear in the narrow sense, if cri(t) := 0, or in other words, the noise term is additive. The autonomous narrow-sense 
linear SDE thus has the general form 

dX(t) = Gxi*(t) + n 2 )dt + a 2 dW(t) (3.31) 

Setting \i x •■= a,\i 2 •■= and a 2 •■= a, one obtains the "latent linear diffusion process" of equation (ref. equation 4 of 
the main tutorial), commonly known as the Langevin equation: 

dX(t) = aX{t)dt + adW(t) (a < 0, a > 0, t £ [0, T]) (3.32) 
The general solution of the Langevin equation is given by (for a proof see (Hassler, 2007)) 

X(t) = exp(at)*(0) + exp(at)*(0) f* a exp(s) dW(s) (3.33) 
which specifies a homogenous Gaussian process with expectation and variance 

E(X(f)) = exp(at) E(Z(0)) (3.34) 

and 



21 



V(Z(t)) = exp(at) V(Z(0)) + ^ (1 - exp(at)) (3.35) 

Euler Maruyama discretization 

A commonly employed discretization technique for SDEs is the Euler-Maruyama method (Hassler, 2007; 
Kloeden & Platen, 1999), which defines a recursive scheme for the numerical evaluation of SDEs on a time interval 
[0, T] c R. The Euler-Maruyama method is based at a set of discrete time points t t G [0, T], i = 0,1, ... , n with 

= t < h =l< ••• < t n _! < t n = T 

which represent the discretization of the observation time interval 

[0,T]=uf =1 [^T,iT[ (3.36) 

T 

into n equisized bins of length At := - , and which in the context of the tutorial may be conceived as being provided 
by the sampling rate of the measurement or observation process. The first step towards the Euler-Maruyama 
approximation of (3.32) to rewrite it in its integral form given by 

X{t) = X{0) + Jjf aX(s)ds + odW(s) te[0,T]cl (3.37) 

In (3.37) the first integral term denotes a stochastic Riemann integral and the second term denotes a Riemann- 
Stieltjes integral. Considering this integral form on the subinterval [£;_!, t £ [ c [0, T] (i = 1, ... , n) then yields 

X(t t ) = Z(ti_!) + ff' aX(s)ds + odW(s) (3.38) 

By approximating the value of X(s) for s G [tj_!, tj] by its left endpoint value, that is, by setting X(s) •■= X^t^-J for 
all s G [tj-!, tj], one obtains the Euler-Maruyama approximation of the stochastic differential equation 

X(td = Wi-i) + ^ aX&t-Jds + odW{s) (3.39) 

= X(t i _ 1 ) + aX(t i _ 1 )^ i ds + og dW{s) (3.40) 

L l-1 L l-1 

for i = 1, ... , n. Riemann integration of the first integral term yields 

J^ i d5 = 5|^_ i =tj-tj_i=^ = At (3.41) 
and Rieman-Stieltjes integration of the second integral yields 

dW(s) = W(t t ) - Witt-i) (3.42) 
From the definition of Wiener processes we note that that 

W(tj) - WOt-J = w(£)-W (^) (3.45) 
is normally distributed with expectation parameter and variance parameter At.To rewrite 

dX(i) = aX(t)dt + odW(t) (3.46) 
in the familiar autoregressive process of order 1 (AR(1)) form 

x t = ax t _ x + s t , £ t ~N(e, ; 0, o%) (3.47) 
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we thus start by considering 

X{td = *(t t -i) + erf(t t _!)At + ff(inti) - ^(ti-i)) (3-48) 

and change the notation by setting 

x t . := X(ti), w ti : = W(tt) and s t . := iv t . - w ti _! (3.49) 

resulting in 

x t . = x t ._ x + ax t At + ae t . where £ t .~N(e t .; 0, At) (3.50) 
Next replace the discrete real time points = t , ... , t n = T by integer indices t = 0, 1,2 ... , T ~ n resulting in 

x t = x t _ x + ax t _ t At + as t where e t ~N(e t ; 0, At) (3.51) 
Using ¥(aX) = a 2 ¥(X), we can rewrite the above as 

x t = (1 + aAt)x t _! + £ t where s t ~N(s t ; 0, ct 2 At) (3.52) 

Finally, we define 

a := (1 + aAt) and cr 2 := a 2 At 

and obtain the AR(1) form 

x t = ax t _! + e t where e t ~N(e t ; 0, a 2 ) (3.53) 

To transfer probabilistic statements over LGSSM parameters to probabilistic statements over SDE parameters, we 
use the probability density function transformation theorem in Supplement Section 9. To this end, it is helpful to 
note that the transformation mappings from LGSSM to SDE parameters, and their inverses are given as 

r 1 :l-»R ) OHT 1 (o)=^(o-l) (3.54) 

with 

FT 1 : M -» R,a >-> 7 , f 1 (a) = (1 + aAt) (3.55) 

and, because we will derive probabilistic statements on the inverse A x of a x using A x = — ^ 

T 2 : R + -» R + ,A X » r 2 (4) = J= (3.56) 

with inverse 

Tf 1 : M + -» E + , a ~ TfV) := (^) (3.57) 
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4 An Introduction to Variational Calculus 

The basic problem of variational calculus is the optimization, i.e. minimization or maximization, of 
functionals with respect to their arguments, usually functions on real vector spaces. In other words, given a 
functional T defined on a function space V, the aim is to find an element y of V for which T assumes an extremal 
value. This basic problem can be understood in analogy to the maximization/minimization of a function defined on a 
standard vector space, such as R or R n . In fact, the basic approach of solving the variational optimization problem 
shows many intuitive similarities with approaches from ordinary calculus (Figure 3, panels A and B). In ordinary 
calculus, the argument x that maximizes a given function / is usually found by first identifying the stationary points 
of/, i.e. those arguments x, for which the derivative or gradient of / vanishes. This is the necessary condition for an 
extremal value: If / has a maximum or a minimum in x, then its gradient in x is equal to zero. However, the 
condition of a vanishing gradient is not sufficient for a (local) extremal value: The gradient may also vanish for an 
inflection point or for constant parts of /. Similar considerations apply for the calculus of variations. The formal 
treatment of necessary and sufficient conditions of functionals is a central topic of functional analysis, just as real 
analysis represents a formal treatment of ordinary calculus. In the current tutorial, functional analysis is eschewed 
and only the necessary condition for extremals of functionals will be considered. This approach is somewhat justified 
by the applied character of this tutorial and the intuition that the functionals considered are usually convex (for 
convex functions, the necessary condition for an extremal is also sufficient (Boyd & Vandenberghe, 2004)). The basic 
problem of variational calculus may thus be formalized as follows: Let V be a vector space of functions, for example, 
the space of all real-valued differentiable functions on the real line V •■= {y £ Further, let T be a real- 

valued functional on V, i.e., let T bea mapping that takes an element y of V as input and maps it onto the real line, 
formally T: V -> R. Then, the basic problem of variational calculus may be stated as finding y* £ V for which T 
assumes a minimum (or —T assumes a maximum): 

y* := argmin yel , T(y) = argmax yel ,(-J'(y)) (4.1) 

A necessary condition for such an extremal argument (and, by appealing to the intuition from ordinary 
calculus, an equation of the type /'(x) = which can be solved for the respective argument x) can be established 
by introducing the so-called "Gateaux derivative" of T: Lety* be a minimum o\T and let v £ V be another function 
in V. Because V is a function space, y* + ev for e £ R is also an element of V. Moreover, because y* is a minimum 
of^ 

T{y* + ev) > T(y*) (4.2) 
This consideration allows for defining a function of e as follows: 

h y *y. R -> R, e >-> h y * iV (e) ~ T(y* + ev) (4.3) 

By definition, this function has a minimum in e = as T{y* + • v) = T(y*) < T(y* + ev). Thus, if h y * v is 
differentiable, its derivative with respect to e vanishes in e = 0, in other words, 

Y e V,v( £ )l £ =o = lim e ^o— = (4-4) 

This derivative of h y * v in e = (if defined) is referred to as the Gateaux derivative of T and is usually denoted as 
8T(y*, v). One thus obtains 

ST(y*,v) := \im^ nr+£V ^ nyl = £ h r ^e)\ E=0 = (4.5) 

A necessary condition for an extremal value of the functional T can thus be expressed as follows: Let y* be an 
extremal value of J 1 , i.e., y* •■= argmin yel , T(y). Then it follows that 
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8T(y*,v) = 



(4.6) 



for all v G V given that ST(y*,v) exists. This condition thus provides an approach for determining an extremal 
argument of a functional: after determining 8T(y, v) for a given function T , set 8T{y,v) = and solve for y. This 
approach yields the stationary point(s) of T , which can then be investigated with respect to their extremal 
properties. Further, in the case of a convex function T, the stationary points already constitute the 
minimum/maximum argument(s). 

4.1 The Variational Problem with Fixed Endpoints 

Above, a necessary condition for an extremal argument y* of a functional T was introduced within a very 
general context. Here, this general approach is made concrete for a specific class of functionals (subsuming, in a 
sense, the variational free energy considered in Section 4 of the main tutorial) and is demonstrated by means of an 
example. Specifically, by limiting the function space to real-valued functions on an interval [x , x-J and specifying the 
class of functionals as the integrals of the so-called Lagrange functions on this interval, one arrives at the Euler- 
Lagrange equation for functional optimization. As will be shown below, the Euler-Lagrange equation defines the 
extremal argument y* of a functional T by means of a second-order partial differential equation, which can be 
solved using methods from ordinary calculus. 

The variational fixed endpoint may be stated as follows: let (x ,y ), (xi,yi) be elements of R 2 and V the 
vector space of all twice-differentiable functions y on (xq^) which fulfill the endpoint conditions y(x ) = y and 
yOi) = Vi, i-e., 

V := {y G C 2 [x , x x ] |y(x ) = y , y(x x ) = y x } (4.7) 

Further, let L be a twice partially differentiable, real-valued function defined on the set 1 x 1 x [*0'*iL referred to 
as the Lagrange function: 

L: M X IR X [x , x x ] -> E (4.8) 
Then, the variational fixed endpoint problem consist of finding ay* eV such that the functional J'defined by 

T:V ^ R,y»T(y) := J Xl L(y(x),y'(x),x) dx (4.9) 

becomes extremal, in other words, finding 

y* := arg mm yev Q L{y, y', x) dx (4.10) 

Note that, y and y' refer to both the functions y,y' G V and the real coordinates y(x),y'(x) G M of the Lagrange 
function L for x G E. 

Specifying the Gateaux derivative of T as in (4.5) for functions v G V with v(x ) = v(Xi) = then yields the 
following necessary condition for an extremal value in the current fixed endpoint case 

8T{y*,v) = ±Q L(y* + ev,y*' + ev',x) dx\ E=0 = (4.11) 

It can be shown that the derivative of the integral in (4.11) may reformulated such that the necessary condition 
above may equivalently be expressed as (see below for details) 3 



3 For simplicity of notation, the asterisk denoting the extremal argument is suppressed in the following equations, which also 
emphasizes the notion of y being a variable which is solved for. 
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ST(y,v) = o y y L(y,y',x) - ^L(y,y',x) = (4.12) 

The equation on the right-hand side of the expression above is referred to as the Euler-Lagrange equation and 
represents an (implicit) second-order partial differential equations for y. This can be seen by evaluating the total 
differential of its second term resulting in (see below for details) 

« jjHy.y'.x) = ^i(y,y',x)y' + ^i(y,y',x)y" + JjL-. L (y,y',x) 

This implicit partial differential equation for y may be made explicit by dividing the second partial derivative of the 
Lagrange function with respect to y' and evaluated using standard methods from ordinary calculus. The resulting 
y E V is a stationary point of T and a candidate for an extremal argument. Again, if T is convex, it is also already 
known to be an extremal argument. In the following example, the calculus of partial differential equations is 
eschewed, as the functional considered is an explicit function of y' only. 

o Derivation of equations (4.12) and (4.13) 

In this section, we derive the Euler-Lagrange equation 

^(y,y',x)-^L(y,y',x) = (4.14) 



based on the necessary condition for a minimum of 



given by 



T{y) ■■= J i L(y(x),y'(x),x) dx (4.15) 



ST(y,v) =fj*L(y + ev,y' + £v',x)dx\ E=0 = (4.16) 



To this end, we first note the definition of the total differential for a function /: M 3 -> R and y = (y,y',x) T ,v = 
(v, v' , 0) T , e £ R is given by 

|/(y + £ i0| £=0 = I?=i^/(y) • Vi (4.17) 
Because the Lagrange function L is differentiable by definition, we thus obtain 

^/o L Cy + £V,y' + £v',x)dx\ E=0 = J ( ^(L(y + ev.y' + £v',x))\ E=0 dx (4.18) 

= J^L(y,y',x) ■ v + jpL(y,y',x) ■ v' +jpL(y,y',x) ■ dx 
= f l -^L(y,y',x)v + ^jL{y,y',x)v' dx 

= Sl^Liy.y' .x)v dx + $ l Q -^L(y,y' ,x)v' dx 
The second integral term is next integrated using partial integration. In general we have 

Saf'g = fg\ b a -£fg' (4.19) 

Here, we have 
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a = 0, b = l,g=£ F L, g' = ±jpL, f = v, f' = v' (4.20) 

and thus obtain 

Written in full, the above yields 

Joa^ L Cy(*).y'(*),*>'<& (4.22) 

= ^^Ky{x),y'{x),x)v{x)\ l - J^g^L(y(x),y'(x),x)v(x)dx 



Substitution into the equation above results in 

ST(y,v) = (4.23) 

« Jo (^(yW./W.») - i(yW./ GO.*)) "W ^ + ^KyCO.y'O). 0*0) - ^L(y(o),y'(o),oMo) = o 

As we are considering only those v for which v(0) = v(l) = 0, because the function y + ev is supposed to fulfill 
y(0) + £i?(0) = y(0) = y and y(Z) + ev{l) = y(Z) = y ( , we have 

&F(y,t;) = o J ^^L(y(x),y'(x),x) - £^L(y(x),y'(x),x)) i;(x) dx = (4.24) 

We now make use of the following Lagrange-Lemma, which we state without proof: 
Laarange-Lemma 

Let f be a continuous real-valued function on the interval [0, 1] and let 

Jo/OM*) dx = 

For all continuously differentiable, real-valued functionsv that fulfill v(0) = i/(0) = v(V) = v'(l) = 0. Then it 
follows that fix) = on [0, 1]. 

With this Lemma, we thus obtain the Euler-Lagrange equation 

ST(y,v) = ^L(y(x),y'(x),x) - ±^L(y(x),y'(x),x) = (4.25) 

Finally, we note that the Euler-Lagrange equation constitutes a second order partial differential equation for the 
function y, because writing out the differential of the second term on its left hand side yields 

" ;£^(y(*).y'(*).*) = ^^UyW,y'M,x)y'(x) + a , UyM, y'(x), x)y"(x)) + -g-L(y(»),y'(»),») 



(4.26) 
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Written in full, the Euler-Lagrange equation in its general form thus corresponds to 



(4.27) 



Example: The Shortest Distance Between Two Points in R 2 

As an example, we consider the following problem: Let (a, a), (b, /?) £ R 2 , a < b. The vector space V may be 
defined as 

V == {y £ C 1 [a, b] \y (a) = a, y(6) = /?} (4.28) 
and the aim is to find a y £ V for which the functional 

J 7 : V ^ E, y ^ T(y) := ^ Ji + y'(x) 2 dx (4.29) 

becomes minimal. The specific form of the functional in this example is motivated by the fact that, for parameterized 
curves in IR 2 (i.e., functions y of the type /: R -> R 2 ), the arc length is defined as 

I: V -> RZ ~ /(/) := /^ll/'WIh dx (4.30) 

and the function y considered here represents the "function form" of such a curve, i.e., f(x) = (x, y (x)) for which 
/'(x) evaluates to (l,y'(x)) T . In the special case of the example considered here the Lagrange function is given by 

L: E ^ R, y ' h> L(y ') := ^1 + y' 2 (4.31) 

L is thus not explicitly dependent on y and x, the partial derivatives of the Lagrange function with respect to y and 
x evaluate to zero, and the Euler-Lagrange equation simplifies to 

^') = ^(/)oA_L L(/) = (4.32) 

The right-hand side of the equivalence relation (4.32) states that the total derivative of -^L(y') is zero and the 
function -^L(y') thus constant. One hence obtains an ordinary differential equation for y given by 



^ 7 L(y') = 0^y'(x) = J I ^ (4.33) 

for a given constant c £ R (see below for details). As the derivative of y is thus constant, y must be a linear-affine 
function of the type 

y(x) = sx + r,s,r £ R (4.34) 
Based on the fixed endpoints y(a) = a,y{b~) = /?, the constants s and r may be evaluated, resulting in the solution 

y*(t) == argmin yel , f^l+ y'(x) 2 dx = - a) + a (4.35) 

The application of the variational method hence verifies the intuition that the functional form of the curve 
representing the shortest distance between two points in the plane is a straight line (Figure 2 panels C and D). 
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o Derivation of equation (4.33) 



We first evaluate the partial derivative j^L(y') for L(y') := <Jl + y' 2 : 



jpW = jy a + y ' 2 )> = \ a + y 'T- 2 • y ' = -jjL= (4.36) 



The Euler Lagrange Equation 



— f . y ' ^ = 



(4.37) 



thus implies that 



3 = c (4.38) 



for a real constant c £ E. Reformulating then leads to the insight that the derivative of y w.r.t. x is constant 



,12 _ „2 



c 2 (l + y' 2 ) <^>y' 2 = c 2 + c 2 y' 2 <^> y' 2 - c 2 y' 2 = c 2 <^>y' 2 (l-c 2 ) = c 2 <^> y' = (4.39) 

The Isoperimetric Variational Problem 

In addition to the conditions of fixed endpoints, the extremal function maximizing/minimizing a given 
functional in the context of isoperimetric problems have to fulfill additional integral conditions. In general, the 
isoperimetric variational problem may be stated as follows: let (x ,y ), (xi,yi) be elements of M. 2 and V the vector 
space of all twice-differentiable functions y on (xq^) which fulfill the endpoint conditions y(x ) = y and 
y(xi) = y±. Further, let L and G be twice partially differentiable, real-valued functions defined on the set 
IxRx [*0'*i]- Then, the isoperimetric variational problem comprises finding ay* eV such that the functional T 
defined by 

T:V -» I,yHf(y) := J* 1 L(y(x), y'(x), x) dx (4.40) 

becomes extremal, and the integral side-condition 

I:V -» B,yH/(y) := J Xl G(y(x),y'(x),x) dx = (4.41) 

The formal development of an approach for the solution of the isoperimetric variational problem proving the 
existence of Lagrangian multipliers falls into the realm of functional analysis and is eschewed here. Nevertheless, 
assuming the existence of both an extremal argument y* and suitable Lagrangian multipliers A, the Lagrange Lemma 
introduced in below provides the following recipe for solving an isoperimetric variational problem: (1) consider the 
extended Lagrange function F:lxMx [x^xj -> R, defined by 

F(y(x),y'(x),x) == L(y(x),y'(x),x) + AG(y(x),y'(x),x) (4.42) 

(2) instead of minimizing T , minimize the extended functional 

f •■= L Xl ^(yW,y'(x),x) dx (4.43) 

using the fixed endpoint approach developed in Section 3.2 (usually by means of the corresponding Euler-Lagrange 
equations) and (3) simultaneously solve for the side-condition J* 1 G(y(x),y'(x),x) dx = 0. If this approach is 

Xq 
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possible, the Lagrange Lemma guarantees that the resulting y* fulfills the isoperimetric side-condition. This 
approach is demonstrated using a classic example below. 

It should be noted that the functions maximizing the functional of interest of the VML and VB frameworks 
are usually defined on the entire real line, rather than on a real interval with fixed endpoint conditions, as 
considered here. For simplicity, however, a formal treatment of the ensuing improper integrals is eschewed, and the 
methods developed in the current section are transferred to the problems considered in the next section by the 
intuition that the probability density functions of interest are usually Gaussian for which /(x) -> in the limits of 
x -* co and x -» — oo. 

The Lagrange Lemma for the isoperimetric variational problem 

If the existence of a minimum y and of the corresponding Lagrangian multipliers X t (i = 1, ...,m) and 
fij (j = 1, ...,.p) is postulated, the following Lemma can readily be proven, and provides an ansatz for the solution 
of the isoperimetic variational problem: 

Laqranqe-Lemma for the isoperimetric variational problem 

Let M be an arbitrary set and J: M -> R a function to be minimized on the restriction set 

R:={yeM\g(y) = 0,h(y)<0} 
where g ■= {g lt ... , g m ): M -> R m and h ■= ... , /i p ): M -» MP define a set of side conditions. Further, let 
Ai G M. (i = 1, ...,m) and fXj > (J = 1, . p) be specified such, that 

9eS:={yes\j^ =1 njhj(y) = o} 

is a minimum of 

Then y is a Minimum of J on R. 

The Lagrange Lemma for the isoperimetric variational problem can be verified by considering the following: 

J(S) =/(y) + + (4.44) 

= J(y) + mi*im(y) + l?j =1 njhj<$) 
<J(y) + YT=ih9t(y) + T, p j=1 Hjhj(y) 
<J(y) 

If, like in the isoperimetric problem, the side conditions are given by equalities, the functions hj\ M -> R may be set 
to zero in the Lemma above, the restriction set simplifies to 

R:={yE M\g(y) = 0} (4.45) 

and the set S may be discarded. Thus, if the problem of minimizing the extended functional 

f := Q F (yM,y'(.x),x) dx (4.46) 

where 

F(y(x),y'(x),x) := L(y(x),y'(x),x) + XG(y(x),y'(x),x) (4.47) 
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as specified in equations (4.46) and (4.47) can be solved, its solution provides a minimum of T which fulfills the side 
condition specified by /. 

Example: The Problem of Dido 

The problem of Dido may be stated as an isoperimetric variational problem as follows. Consider the vector 

space 

V := {y £ C 1 [a, b] \y (a) = = y (ft)} (4.48) 
The aim is to find a y £ V, which minimizes the functional 

T: V R, y » T(y) := - /* y(x) dx (4.49) 

subject to the constraint (side-condition) that 

g(y) = 0, where g: V -> R,y ^ g(y) := J^l +y'(x) 2 dx - I (4.50) 

for some constant I £ R. Intuitively, this problem may be stated as maximizing the area between the graph of y and 
the x-axis using a y of a given constant arc length I (Figure 2, panels C and D). Applying the recipe for the solution of 
the isoperimetric variational problem discussed above, one obtains the following extended functional (see below for 
details): 

f ■■= ~ £ yW dx + A £ Vl+y'(x) 2 dx - I = £ (-y(x) + AjT+yW - ^) dx (4.51) 
where the Lagrange function is not explicitly dependent on x and is given by 

1:1x1, (y,y')» L(y,y') = -y + AjTTy 7T -^ (4.52) 

For a Lagrange function that explicitly only depends on y and y' , the Euler-Lagrange equation simplifies to (see 
below for details) 

L{y,y')-^jL(y,y')y' = B (4.53) 

where B £ R is a constant. Evaluation of the Euler-Lagrange equation for the problem of Dido results in the 
following first-order ordinary differential equation for y (see below for details): 

JT+7 T (-y--^-B)+A = (4.54) 

A solution for this differential equation is given by 

y(x) = C + V 'A 2 - (x - D) 2 (4.55) 

where C •■= —B — and D £ R is an additional integration constant (for a derivation of this solution, see (Richter, 
Mathias, o. J.)). Taking the square of the equation above then results in 

(y(x) - C) 2 + (x - D) 2 = A 2 (4.56) 

which is recognized as the functional equation for a circle with the center at (D, C) £ R 2 and radius A, while further 
analytical considerations allow for inferring that D = and C > (Richter, Mathias, o. J.). The solution to the 
problem of Dido is thus a circular arc, the exact form of which depends on the relationship between the endpoints 
a, b and the given arc length I of y (Figure 2, panels C and D). 
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o Derivation of equation (4.53) 

To show that indeed 



L(y,y',x)-^L(y,y',x)y' = B (4.57) 



for a real constant B £ R, we show that the total deriviative of the function 



fiy.y'.x) ^Liy.y'.x^-y'^Liy.y'.x) (4.58) 
vanishes, provided that L is not an explicit function of x, and thus ^-L = d L = 0: 

dx dy'dx 

= ii L + y'r y L + y" jy L ) - {y"jy L + + *' 2 si/ 1 + y'y"^) 

'(— L-— — L) 

y \dy dx dy' ) 

= 

where the last equation follows from equation (4.54). 
o Derivation of equation (4.54) 

We apply 

L(y,y')--^L(y,y')y' = B (4.60) 

to the Lagrange function given in (4.52) 

L(y, y') := -y + AjT+y^ - ^ (4.61) 

and obtain 

- y + A JTTy^-^ a - y '^(- y + AjTTy^- 1 ^) = B (4.62) 



dy' 

^-y+^y w -£-y' 7 Bpf =B 



Multiplying both sides by -J\ + y' 2 yields 



-yVl+y' 2 + A(l + y' 2 ) - ^V 1+ y' 2 - Ay' 2 = B^Jl + y' 2 (4.63) 



<=> -yVl + y' 2 + A + Ay' 2 - +y' 2 - Ay' 2 - Bjl + y' 2 = 

^VT+y r (-y-^-B) + A = o 
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A B 



x* = argmin/OO => ^/OOU*- = := argmin:F(/<X)) => ^(/(x) | /w=/ * w = 




Figure 2. An intuitive notion of the calculus of variations. Figure 2A depicts a standard calculus analogy to the extremal problem 
of variational calculus. To find the extremal points x* from a given real-valued function /, standard calculus usually proceeds by 
finding its stationary points. The stationary points of a function are those arguments of the function for which the derivative ^/ 
of the function vanishes, i.e., for which ^f{x)\ x=x > — holds. Figure 2B depicts a conceptual overview of the variational 
calculus framework: central to variational calculus is the problem of identifying a (real-valued) function / in a function space V, 
which extremizes a given functional T. The functional considered in this review (especially the variational free energy) map 
functions (in the context of the tutorial, probability density functions) onto the real line. Candidate functions are found by 
capitalizing on the necessary condition for an extremal, i.e., a vanishing Gateaux derivative 8T(f(x))\^^ = f^ = and solving 
for f*(x). Figures 2C and 2D sketch the variational problems and the solutions to them considered as examples in Section 3. The 
fixed endpoint problem can be conceived as the problem to find the curve (i.e., a mapping /: R -* R n , here, n — 2) of minimal 
length connecting to given points in R 2 . The application of the variational calculus method allows for identifying the straight line 
connecting both points as the optimal function (curve). The isoperimetric problem (here, with fixed endpoints) represents a 
variational problem with additional integral side-conditions. The problem of Dido lies in finding the function of given arc length 
that maximizes the area between the function and the ordinate. Using the calculus of variations and a Lagrange multiplier 
approach, it can be shown that the solutions are circular arcs whose exact form depends on the predetermined arc length. 
Maximization of the variational free energy in the VB framework may be viewed as an isoperimetric problem. Here, the 
functions maximizing the variational free energy need to fulfill the integral side-condition of being probability density functions 
on the real line, i.e., their integral over the real line should equal 1. 
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Derivation of the VB inference theorem for mean-field approximation 

In the VB framework the aim is to approximate the log marginal likelihood In p(y) by iteratively maximizing 
its lower bound T (<7(i9 s ), q(fl\s)) with respect to the arguments q(i9 s ) and q(fl\s)- For iterations i = 1,2, during 
the first maximization of finding q ,( -' +1 -'(t9 s ) / q^{p\s) is treated as a constant, while during the second maximization 
of finding q^ l+1 \^ s ), q^ l+1 \^\ s ) is treated as a constant. However, as i9 s and i9\ s may be used interchangeably, we 
here concern ourselves only with the case of maximizing T (q(fi s ), <?(i9\s)) with Respect To <7(i9 s ). To obtain an 
expression for q^ +1 \^ s ), we thus consider the functional 

T(q(d s lqW(d\ s )) = ffq(0s)qW(?\s) In ( ^g^j ) ^ V where / ^ M = 1 < 4 - 64 ) 
The extended Lagrange function is given in this case as 

F (q(iU qW(p\s)) = Iq(.$s)q {i) (V\s) In (^g^)) + Iq&s) ~ * (4-65) 

and, as in the previous section, the Gateaux derivative ST (q(x), q'-'K^)) is given by the derivative off with respect 
to <7(i9 s ), as T is not a function of <7'(i9 s ) . One thus obtains with a constant C £ R (see below for details) 

ST (q09 s ), <?«0V)) = I q {i) (V\s) In p(y, 0) dt?\ s - In q(ti s ) + C (4.66) 
o Derivation of equation (4.66) 

8T(q(jd s ),qV({)\ s )) (4.67) 

= a^fe {S«Ws)q®(p\s) lnp(y, i9) dti\ s - J q(V s )q (i KV\s) In V°0V)) dti\ s + Aq(d s ) - a) 
= i^) ^ 9 (i) (^\s) lnp(y,i9) di9\ s - J q(&)<7 (i) 0V) ln <7<&) £*0\s ~ J" <7(&)<7 (i) 0V) ln <70V) <*0\s + Aq(0 s ) ~ A) 

= ^ (<? <A) / 9 C0 (A*) ln PCy- * ) dl9 \s ~ 9 Ws) In q (fis) / 9 C0 OV) **Vs " 9 (*s) / 9 C0 0\s) ^ R (*V) <*V + Aq (0 S ) - A) 
= Jq®(t9 Xs )lnp(y,t9)dt9 Xs - \nq(d s ) - 1 - JV° (t9\ s ) In q ® (t9\ s ) dt9\ s + A 
= J <7 (O 0V) In P(y, i9) di9 Xs - ln q(0 s ) + C 

□ 

Setting the Gateaux derivative (4.63) to zero thus yields 

\nq( i+1 \ti s ) == jY«(iV)lnp(y,0)diV + C ( 4 - 68 ) 

Taking the exponential, expressing the multiplicative constant as proportionality factor then yields "VB inference 
theorem for mean-field approximations" 

q^ +1 )(i9 s ) a exp(i q(d\ s )\np(y,-d)dd\ s ) (4.69) 
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5 The Variational Maximum Likelihood Framework and its Application to LGSSMs 



Currently under extensive revision 
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6 Mathematical details of the univariate Gaussian example 



6.1 Introduction 

To demonstrate the VB inference method for i.i.d. data we consider the Bayesian estimation of the 
expectation and variance parameters of a univariate Gaussian (discussed in detail in (Michael Chappell, Adrian 
Groves and Mark Woolrich, 2008; W. D. Penny, 2000) and visualized in Figure 5 of themain tutorial). We thus 
assume that the data is generated by a univariate Gaussian distribution over an observed variable y with true, but 
unknown, expectation parameter \i and inverse variance parameter A, i.e., 

V {y):=N{y; l i,X- 1 ) (6.1) 

Further, we assume that a set of N i.i.d. realizations y{, ... , y^ of the observed variable y has been obtained. In this 
scenario, a classical maximum likelihood point estimation approach would result in estimates for pi and A' 1 based on 
the sample mean fi and the (biased) sample variance a 2 

fi = ln=iy n ^^ 2 =^Zn=l(yn-M) 2 ( 6 - 2 ) 

respectively. As outlined in the tutorial, based on appropriately chosen prior distributions, the aim of the Bayesian 
paradigm is to obtain posterior probability distributions, which quantify the remaining uncertainty over the true, but 
unknown, unobserved variables given the observed variables and an approximation to the model evidence, i.e. the 
probability of the data given the generative model. The generative model of the current example thus constitutes a 
joint probability distribution over the observed variable y and both unobserved random variables \i and A, A > 0. It 
is thus given by 

p(y,H,X) = p(ji,A)p(y\n,A) = p(n,A) Hn=iN (y n |/^A _1 ) (6.3) 

A possible choice for a prior joint distribution over the unobserved variables is given by the product of a univariate 
Gaussian distribution over fi and a Gamma distribution over A according to 

(y,lx,A) = p(jiMA)p(y\ii,A) = N (ji; m^, s 2 )G(A; a x , b x ) Un=i N (jn A -1 ) (6.4) 

We consider a mean-field approximation to the posterior distribution over the data conditional unobserved 
variables, i.e., we set 

p(H,A\y) * q{n)q(A) (6.5) 

Specifically, we set q(ji) to a normal distribution over/i with variational parameters m q and s£ q , and we set q(A) to 
a Gamma distribution over A with variational parameters a^, b% resulting in 

qG0q(A) = N{x;ml,sl q )G{A;al,bl) (6.6) 

In (6.6), {m^,s 2q ,a%,b%) represents a set of "variational" parameters. For convenience and to keep the notational 
overhead at bay, the values of the variational parameters on the ith iteration of the VB algorithm derived below will 
be denoted by m^\s^ l \a^\ and b^\ dropping the "q" superscript. To summarize, we have introduced a set of 
prior parameters [m^s^, a x , b x ] governing the prior distribution p(pi,X) of the generative model in (6.3) and a set 
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of of variational parameters ^m^\ s^ l \ governing the variational approximation q(pL)q(X) to the 

posterior distribution p(fi,A\y). 

6.2 Application of the variational inference for approximate posteriors 

As discussed in the tutorial the variational inference theorem for mean-field approximations (cf. equation (17)) 
states that the variational distribution over the unobserved variable partition i9 s is given by 

q(fl s ) oc exp(J q(d\ s )lnp(y, i9) dd\ s ) (6.7) 

For the current example, we have the mean field approximation 

qiH.X) = q{n)q{X) (6.8) 

and thus 

qGO = C„ • exp (/ q {X) In p (y, fi, X) dX) (6.9) 

and 

q(X) = C x • exp(J q(ji)\xip(y,n,X)dix) (6.10) 

where and Q denote proportionality constants that render the proportionality statement in (6.7) equations in 
(6.9) and (6.10). In the following, we derive an iterative scheme based on the equations above. For this purpose, it is 
helpful (1.) to denote expectations by the expectation operator 

</O0W) : = Sp(x)f(x)dx (6.11) 

because it considerably reduces the visual complexity of the expressions involved, (2.) explicitly denote the iterative 
nature of the approach by indexing the variational distributions q(ji) and q{X) as q^(fi) and q( l \A), which also 
stresses the fact that in (6.9) and (6.10) above, the left hand side variational distribution refers to the (i + l)th 
iteration, while the right hand side variational distribution refers to the ith iteration, and (3.), because we are 
dealing with probability density functions of the exponential family, to log transform the expressions above to 
simplify proceedings. Upon these notational changes, (6.9) and (6.10) can be re-expressed as, in expectation- 
maximization algorithm terms as 

E Step 

lnq^Gu) = (p(y,n,X)) q m w + C M (6.12) 

MStep 

\nq( i+1 \A) := <p(y,H,X)) q w w + C (6.13) 

In the following two sections, we will derive explicit expressions for the variational parameters m^ +1 \ s^ l+1 ^ and 

a% +1 \b^ +r> that govern the variational distributions q^ l+1 \fi) and q^ l+1 \X), respectively. These expressions are 

given in terms of the observed data, as well as the preceeding variational parameters m^p,s^ l \a^ and as will 
be seen below. 
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6.3 Variational parameter update-equation derivation for the E Step 

As noted above, for the E-Step, the variational inference theorem for mean-field approximations states that 
the logarithm of the optimal distribution over the unobserved variable [l, lnq( l+1 ^(ji), given a fixed distribution over 
the parameter variable, q^{X), is given by 

\nq( i+1 \ii) = (p(y,n,X)) q m w + C M (6.14) 

which, if we identify the prior distribution p(fi)p(A) with the initial variational distribution q^(jj.)q^Q) may be 
rewritten as 

Inq^Kp.) = -\{X) q m w Z^=i(yn - M) 2 - + C„ (6.15) 

2s ^ 

Based on (6.15) and the completing-the-square theorem (Supplement Section 2), one can infer that the optimal 
variational distribution over the latent variable [i is proportional to a Gaussian 

q^MKN(vmj; +1 \sf +1) ) (6.16) 

where the updated variational parameters are provided in terms of the data y\,...,y^ and the variational 
parameters of q^(fi) and q^{X) as 

« + l) ._ mf+^a^TS^yn ,(/+!) ._ 

^ - i + Ns>«aM - l + <a? t «) (6 " 17) 

o Derivation of equation (6.15) 

For a fixed distribution qW(A), we have due to the i.i.d. data assumption and the assumption of factorized priors for 

y = Vim 

lnqC+^Oi) = {p(y,H,X)) q m w + C„ (6.18) 

= (ln(rin=lP(yn<M^))>q(0 ( A) + C M 

= On(rin=lP(ynlM^)p(M^))>q(i)(A) + C M 

= <ln(n^iP(ynlM^)p(M)p(A))> q (0 (A) + C M 

Setting 

pOx)p(A) = <7 (0 G0q (0 (A) for i = (6.19) 
i.e. initializing the variational distribution with the prior distribution, we have for iterationsr i = 0, 1,2, .... 

lnq^(pi) = (ln(n^iP(ynlM^)? (0 (M)? (0 W)> q « (A) + C„ (6.20) 
= (ln(m=iP(ynlM,A))) q (0 W + (toq®(n)) q m w + (lnq^(.X)) q m w + 
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Substituting the example specific functional forms of p(y n \fi,A),q^(fi) and q^ l \X) from (6.4) and (6.6) above, 
namely 

i 

P(yJM) == (4)' ex p("i On -^ 2 ) ( 6 - 21 ) 

q©G0 == (27r S f y^exp(-^(/,-m«) 2 ) (6.22) 



and 



in (6.20) then yields 



ln^ +1 )(x) = <in(n£U(£) ? exp(-f (y n " M) 2 ))) q (0 (A) (6.24) 
+ (In ((27r S 2 (l) ) 2 exp f" ^® (m - rnf) Jj ) q m w 

+ <ln fe^ Aai0 " leXP ("^))^« + ^ 



= (-lnA--ln27r--2n=i(y n -^) 2 >q(0 W) 
+ <--ln(27r) - -ln^ - J ) qmw 



+ 



(- In (r (a®)) - ln(bf )4° + (a® - l) In A - ^)> £?(0 (A) + 



Grouping all constant terms independent of [x (i.e. those terms that do not change if /i changes) with the constant 
and evaluation of the expectation then simplifies the above to 

lnqC^GO = -^=i(y„ -M) 2 Vo tt) - O^rVw + C m (6-25) 

= - i w E£ =1 (y n - m) 2 - + c„ 



o Derivation of equations (6.16) and (6.17) 

We first note that the expectation of A under q W (^) ma y be expressed in terms of the variational parameters as 
and the term Ym=iiyn ~ /-0 2 mav De rewritten as 

Yn=i(.y n - m) 2 = T.n=i(yZ - ^ + m 2 ) = ZnUy? - 2^ =1 y n + (6.27) 

By reordering (6.15) in terms of powers of [i we thus obtain 



39 



InqC'^OO = -afbfVUyt ~ 2^=^ + ^ 2 ) - ^"W) + C|t (6.28) 

2Sn 



-_„Wl(0 V IV 2 , ..JOJOn yiV _ n (.0 h (Q Nll 2 t ^ m M , V ft J i r 



.2 k>) 



= -Nafbff - -^f + (2a«b« Y, N n=1 y n )» - ^ + C M 



1 { i +N sl W a«> b V \ 2 ( (0 (0 N m®\ 
= _i V V J V Sn=iyn - ^jj/i + C M 

Using the completing-the-square theorem in the form 

exp (- iax 2 - to) = N(x; oCH, a -1 ) • C (6.29) 



then yields 



q C £+1 )( M )ocN( M ;m^ +1) ,52 a+1) ) (6.30) 



where the parameters m^ +1 ^ and s^ t+1 ^may be expressed in terms of the ith variational parameters as 
and 



c ^'"' ft "A "A \ _ t!t IR^'W 

S V ~\ „2® | _ , , „„ 2 (0„(i).(i) l ^ 1 ' 



? (0 / (0 

.(<+!)■_ s £ I MuWvN „ _ m 



6.4 Variational parameter update-equation derivation for the E Step 

For the M-Step, the variational inference theorem for approximated posteriors (6.13) states that the 
logarithm of the optimal distribution over the unobserved variable A, given a momentarily fixed distribution over the 
unobserved variable [i variable, is obtained by setting (with a constant C x e M) 

lnq( t+1 )(A) == <p(y,/U)Vi+i) (M ) + Q (6.33) 

which, if we identify the prior distribution p(x)p(A) with the initial variational distribution q(°\x)q(°\X), may be 
rewritten as 

In q( i+1 \A) = \ In A - \ <l£ =1 (y n - n)\^ w + (4° - l) In A - ^ + C A (6.34) 
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Further, upon evaluation of the expectation and taking the exponential of (6.13), it follows that q( l+1 \X) is 
proportional to a Gamma distribution 

G(A;a£ +1) ,^ +1) ) (6.35) 

with parameters 

> == N - + af and == + \ (l£ =1 y 2 ~ 2 y n ^ +N ((m^f + s* a+1) ))) ' (6.36) 
o Derivation of equation (6.34) 

In analogy to the derivation of equations (6.22) and (6.23) we have 

ln^ +1 )(A) = (ln^=i(4)' ex p(-l(yn-M) 2 ))>^ +1 ) M (6.37) 
+ (In {(2nsf +1) ) 2 exp ^-—1^ - m£ +1) ) 2 ^> q(i+1)0i) 



= (fin A - iln 2tt - ^Zn=iOn - V) 2 ) q a+D (ll) 
+ (--ln(»--ln S/ 2 ~ n „ 2 « + i) >,(^o at) 



,(0 



(- in (r (a®)) - in (bff + (af - l) In A - + Q 



+ ■ 

Grouping all constant terms independent of A (i.e. those terms that do not change if A changes) with the constant 
Cx then simplifies the above to 

In q^\X) = ( fin A - ^ =1 (y n - n) 2 ) q a^ (ll) + ((af - l) In A - ^>,« + o w + C x (6.38) 
Evaluation of the expectation of pL under q^ l+1 ^(ji) then yields 

l nq a+D(A) = fin A - f<^ =1 (y n - *) V +1 >00 + (af - l)lnA - ^ + C x (6.39) 

□ 

o Derivation of equations (6.35) and (6.36) 

Summarizing the right hand side of equation (6.34) in terms of logarithms of A yields 

In q^\X) = (f + af - l) In A - -\(Y^ =1 {y n - fOV +1) Go) A + C * < 6 - 40 > 

= (f + af - l) In A - - i(Z^iyn 2 - 2nT,% =1 y n + Nn 2 ) q « + i) Ql) y+ Cx 
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X 

(6.41) 



= g + - i)inA- ^-i(z^iyn 2 -2Z^iyn(MV + i) (M) + N{fi 2 ) q a^ (ll) )y + c l 

where the expectations may be expressed in terms of the variational parameters of q^ l+1 \fi) according to 
lnqC'^CA) = g + af - l) In A - - 2^ =1 y n mf + JV ((m£ +1 >) 2 + +1) )))a + C, 

Taking the exponential on both sides then yields 

Up to a normalization constant Q, <7^ +1 ^(A) is thus given by a Gamma distribution with 

q (i + i)« A 4 +1) exp (-J-\ (6.43) 



(6.42) 



where 



and 



of =?+«?> (6.44) 



'I'"' = + i (S-> " 2 2f.i V, m« + » + JV ((mf «) 2 + % 2( ' +1> ))) 1 (6 ' 45) 



6.5 Evaluation of the variational free energy 

While the maximization of the variational free energy has been exploited to derive the variational parameter update 
equations, the value of the variational free energy has not directly been evaluated. To obtain a value of this (implicit) 
target function to monitor the evolution of the algorithm, and, more importantly, to obtain an approximation to the 
marginal likelihood upon VB-EM convergence, it is desirable to evaluate the variational free energy integral (cf. 
equation (13) of the main tutorial). We first reformulate the variational free energy as follows (W. D. Penny, 2000) 

= I ^ (^) dd - J q(d) In gg) dd == L m {p{y\d),q{fi)) ~ KL{q{d)\\p(jd)) (6-46) 

The first term in the expression on the right hand side of (6.46) is often referred to as the "average energy" and the 
second term is the KL-divergence between the variational and prior distributions. 

o Derivation of (6.46) 

By definition of the variational free energy, we have 

= /?(#) I" (^£r)<W (647) 
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= C av (p(yWXqW)-XC(q(mpm) 

□ 

The "average energy term " £ av may further be reformulated as follows 

£ m (p(y\nq&)) = I q&) lnp(y|0) dd + HCqtf)) (6.48) 

o Verification of (6.48) 

By definition of the average energy and entropy, we have 

L m = Sqmin(^)dB (6.49) 

= f q(fi) lnp(y|0) dd - J q&) In q(d) dd 
= jq(d)\np(y\d)dd+K(q(d)) 

□ 

Upon this reformulation the respective terms comprising the variational free energy, 

^009)) = Jq&)lnp(y\d)dd + H(q(d))-X£(q(d)\\P&)) (6-50) 
may then be evaluated in turn as shown below, resulting in 

(y,N,mf,sl®,af,b®) := L av - KL(q®(n)\\pQi)) - KL(q®(X)\\pW) (6-51) 

where 

L av = (4°) + In (ft®)) - |a®ftj° (Z^ =1 yn 2 + * ((m®) 2 + s*®) - 2m® E£ =1 y n ) (6.52) 

We thus obtainthus obtains an expression for the variational free energy as function of the observed data y 1( ... , y n , 
the variational parameters m®,s^®, a®, Zj® and the prior variational parameter values b^°\ 

o Verification of (6.50) 

We first recall the definitions of q(y |i9), p(i9) and q(i9) for the current example: 

P(yh?) = pOI/U) = nSU* CMM) (6.53) 

p(0) = pOx.A) = p(x)p(T) = iV(/i;m M ,s2)G(A;a A ,^) (6.54) 

and 
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= q(jx,X) = q(ji)qW = N [fx; mf , s 2& ) G(A; af, fcf ) (6.55) 

(a) Evaluation of f q(ti) \np(y\ti) dd in (6.50) 
Substitution of (6.53) in (6.50) yields 

tfq(ji)qW 1" (ilSU N (y n \n, A)) dfidA = ff q(ji)qW In {(£f n£=i exp (- \ (y n - tffj dfidA (6.56) 

= ffq(n)qCX) gin A - \ln 2n - ^ =1 (y n - fi) 2 ) dfidA 
= \55 90*)?U) On A) d/idA - J/ q(/i)q(A) (^Zn=i(yn - fi) 2 )dfidA - ^ln 2tt 

= i J q(A) In A dA - J q(A) ^ (J gOOG^Cy* ~ i") 2 ) dfO dA-\\n In 

The first integral term above is the expectation of a logarithm variable A under the variational Gamma distribution 
G [A; which can be shown to evaluate to 

f q{A) In A dA = xp (b®) + In af (6.57) 

where 

4>(x) •■= ^ T(x) (6.58) 

denotes the di-gamma function. The verification of (6.57) is left as an exercise to the interested reader. The second 
integral term in (6.56) evaluates to 

I qW \X(J q^)dLi(.yn ~ M) 2 ) dn) dA = j q{A) Qa) df q(jJ.)(I,% =1 y 2 - 2pi l£ =1 y n + Npi 2 ) dpi) dA (6.59) 

= iJqW AGiLiyn 2 - 2 n=iy n J + Nf q{n)n 2 dn) dA 

= \afb® (x N n=1 y 2 - 2mf Y% =1 y n + N (( M ®) 2 + sf^ 

(b) Evaluation ofH(q(d)) in (6.50) 

It is a well-known result from information theory that the joint entropy of statistically independent random variables 
is additive (Cover, 1991). We thus have 

X(q&)) = Jf (qOx)q(A)) = X(qQi)) + Jf (q(A)) (6.60) 
The entropy of the variable \i corresponds to the differential entropy of the variational Gaussian distribution 

qWQi) = N(n;m®,s* i0 ) (6.61) 

and is thus given by (see e.g. Bishop) 
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Jf(qGO) =h(n (fi;m®,s* i0 y) = ^nsf + \{1 + In 2tt) (6.62) 

The entropy of the variable A corresponds to the differential entropy of the variational Gamma distribution 
and is thus given by (see e.g. Bishop) 

H(qVL)) = X(C (A; af , if)) = if + In af + In (r (if)) + (l - if) ^ (if) (6.63) 



CcJ Evaluation ofXL(q{d) | |p(i9)) f/j (6.50; 



We first note that we have 



KL(q(n)q(X)\\p(n)p(X)) = K£CqQi)\\pQi)) + XL{q{A)\\p{A)) (6.64) 

because 

XUqQijqW I IpCOP W) = / qCpM-O In (ggg) * <U (6.65) 

= /*WA)(ln^ + l„g)dMA 

-/,o.)(m^)*. + ;,w(in^)« 

= JC£( t )(„)||p(„)) + 3£X(<((A)||p(A)) 

In the current example, the variable [L is governed by a Gaussian distribution. The prior distribution p(ji) 
corresponds to the initial variational q^Qx), while the variational distribution q(ji) corresponds to the i variational 
distribution over fi. Hence, we have 

X£(qQi) | IpGO) = XL (q® GO \ \q^Qi)) = XL (n (ji; mf , sf) \ \N ( M ; m<?\ sf^ (6.66) 

As noted in [Penny [xXXRef]], the KL divergence for two Gaussian distributions is a function of the respective 
distribution parameters and given for the current example as 



XL ( N ( M ; mf, sf>) | \N ( M ; mf , ^ (0) )) = 



2(0) 

l 

7w 



+ ■ 



+m;/ +s 



2s: 



,(0) 



(6.67) 



Finally, for the current example, the parameter variable corresponds to A as is governed by a Gamma distribution. 
The prior distribution p(A) corresponds to the initial variational distribution q(°\A) while the posterior variational 
distribution corresponds to the ith variational distribution over A. 

XL{q{A)\\p{A)) = X£(q^a)\\q(-°\X)) = Xl(g (l;af ,if ) \\G (k;af\bf)) (6.68) 

As noted in (Penny, 2001), the KL divergence for two Gamma distributions is a function of the respective Gamma 
distribution parameters and given for the current example as 



^( G (A;af,if)|| G (A;af,if)) 



(6.69) 
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= - 1)*(6») - Inaf - fcf - lo g r(if ) + lnr(fcf) + fcf Inaf - (fcf - l)(*(ft«>) + lnaf ) + ^ 



Concatenating the results from (a), (b) and (c) then results in expressions (6.51) for the variational free energy. 
The VB algorithm derived in this example is shown in pseudocode form in Table 3. 



□ 



% Initialization of the variational parameters and variational free energy 



(o) o 
:= ml 

-2(«» ._ „2° 

•v — m 

4 0) == 4 
*f == bl 



% evaluation of the initial variational free energy 

T<® == T {y,N, m f\sf\ m f\sf\af,bf,ar,bf) 

%VB iterations 

for i = 0,1, 2, ...until convergence do 
% E-Step 

(i + 1) _ m M +-V a X °X Ln=iyn 

2 (t+l) 4 (0 



% M-Step 



a X 2 A 



== (^+i(z^yn 2 -2i- =1 y n mr ) +^(« +i) ) 2 



-i 



% Free Energy Evaluation 

?™ == F (y, N, mS +1 \ si mf , ^ (0) , b™. af\ if) 



end 



% VB posterior unobserved variable distribution and log model evidence approximation 

P(/U|y) w == * ( W m< i+1 • C (A; 4 +1) ^ +1) ) 

lnp(y) VB :=T^ 



Table 3. Pseudocode for the VB Algorithm for the univariate Gaussian. 
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7 The Kalman-Rauch-Tung-Striebel Smoothing Algorithm 

For LGSSMs, the evaluation of conditional distributions over the latent variables x 1:T given an observed data 
sequence y{. T and a known parameter 8® corresponds to the "state space model inference" problem for which a 
variety of solutions exist in the literature (for a comprehensive review see e.g. (Briers, Doucet, & Maskell, 2010)). The 
most popular solution is perhaps the KRTS smoother, which represents the combination of the classic Kalman Filter 
for state space models (Kalman, 1960) with a backward algorithm (Johari, Desabrais, Rauch, Striebel, & Tung, 1965). 
In brief, the KRTS smoother is a recursive algorithm that yields the expectation and covariance parameters of the 
conditional marginal latent variable distributions in matrix product form. 

Before proceeding, it is helpful to differentiate the terms "filtering" and "smoothing". As stated above, the 
aim of inference for LGSSMs is to derive probabilistic conclusions about the states of a hidden variable x 1:T given a 
realization sequence of the observed variables set y 1:T = y{. T . For each x t , t = 1, ...,T, these conclusions can in 
principle be derived from the joint distribution of the LGSSM p(x 1:Tl y 1:T ) as specified in equation (9) of the main 
tutorial by appropriately conditioning on the remaining variables. Depending on whether in the conditional marginal 
distribution p(x t |y 1:fe ) the value of k is smaller than, equal to, or larger than the value of t, inference of the 
distribution p(x t |y 1:fe ) takes different forms and is labeled differently. Specifically, for k < t, inferring p(x t |y 1:fe ) is 
called "prediction," and will not be considered here. For k = t, inferring p(x t \y 1:t ) is referred to as "filtering." As will 
be seen below, filtering also forms the basis for the case of k > t. Inferring p(x t |y 1:T ) is known as (fixed interval) 
"smoothing." Importantly, in addition to the information available from observing y 1:t as in filtering, smoothing also 
takes into account the evolution of the observations after x t obtained a specific hidden state. Intuitively, the 
smoothed conditional marginal distribution p(x t |y 1:T ) is hence more veridical than the filtered conditional marginal 
distribution p(x t |y 1:t ). 

In the following discussion, we omit the superscript (i) from the parameter 0® for notational brevity, 
keeping in mind that in the context of the exact VML-EM the parameter 8 correspond to the parameter estimate at a 
given iteration of the algorithm. 

Denoting the dimensionality of the latent variable by k G M and the dimensionality of the observed variable 
by p G N, expressions (5) of the main tutorial is written more generally as 

x t = Ax t _ x +£ t (x t eR k ,Ae R kxk ,£ t ~N(£ t ;0,Z x ),0(E R k ,I, x G R kxk ,Z x > 0) (7.1) 

y t = Bx t + r} t (y t G R p , B G E pxk , rj t ~ N(j] t ; 0, I y ), G R p , I y G E pxp , I y > 0) (7.2) 

for t = 2, ...,T and with the assumption of independent dynamics and observation noise, that is, C(£ t ,r] t ) = G 
R kxp . It is central for the understanding of LGSSMs that these two equations specify a multivariate joint Gaussian 
distribution over the latent variables x 2 , ...,x T and the observed variables and y 2 , ...,y-r- Specifically, due to the 
Gaussian properties of e t and rj t , equations (71) and (7.2) specify the following conditional probability distributions 
over state and observations variables x t and y t , respectively 

P(*t l*t-i) = N(x t ; Ax t _ lt T. x ) and p(y t \x t ) = N(y t ; Bx t , S y ) , t = 2, ... , T (7.3) 

The specification of the joint distribution for the LGSSM is completed by a further Gaussian distribution over x x with 
expectation parameter n x G R k and covariance S x G E kxk ,I x > 0, 

pO x ) = N( Xl ; (7.4) 

and the identification of 

p(yil*i) ~N( yi ; Bx^y) (7.5) 
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Using the "colon notation" abbreviations 

x 1:T : = (x lt ...,x T ) andy 1:T : = (yi,...,y T ) 



(7.6) 



then allows for writing the joint distribution over latent and observed variables as 

Pe(x 1:T ,y 1:T ) = P0O1) Uj= 2 Pe(*tl*t-i) FlLi PeOtkt) ( 7 - 7 ) 

Here, the subscript 8 refers to the parameter 4 of this distribution given by 8 •■= fji 1 ,'£ 1 ,A l I y }. As discussed in 
the tutorial, expression (5) states that the joint distribution over all variables x 1:T and y 1:T of the LGSSM is given by 
the product of Gaussian marginal and conditional distributions over x x and x 2: T,yi:T> respectively. Because the 
product of Gaussian probability densities is again a Gaussian probability densitiy, (7.7) hence specifies a Gaussian 
joint distribution by means of its factorization properties. 

Kalman Filtering - Inferring p g (x t |y 1:t ) 

Inferring a probability distribution over the hidden variable x t given all observations up to y t , i.e., Pg(x t \y 1:t ) is 
performed readily due to the analytical properties of Gaussians distributions. As the joint distribution over x 1:T and 
y 1:T is Gaussian, all marginal and conditional marginal distributions are also Gaussian distributions. Further, as 
Gaussian distributions are characterized by their first two central moments, i.e., their expectation and covariance 
parameters, it is only these parameters that have to be inferred to fully characterize the filter distribution of interest 

Pe(*tlyi:t) = N (. x t m >Hx t \y 1:t >X Xt \ yi . t ) (7.8) 

The aim of the following section is to find a recursive expression for the conditional expectation and covariance 
parameters \*- Xt \y x . t e ^ k ar| d ^x t \y vt e ^ kxk , ^x^y^t > of (7.8) in terms of the parameters of the "preceding 
distribution" 

Pe(*t-llyi:t-l) = N (. x t-l^x t _ 1 \y 1 .. t ^x t _ 1 \y 1 .. t _ 1 ) (7-9) 
Formally, this can be achieved in two steps 5 : first, by finding the parameters of 

Pe(*t,y t |yi:t-i) = N ((^;^rlyi:t-i' Z Wtlyi:t-i) (7-10) 

in terms of the parameters of the distribution Pe(x t _ 1 ,y t _ 1 |y 1:t _ 1 ), and second, by conditioning Pe(.x t ,yt\yi-.t-i) 
on y t , yielding the parameters of the distribution of interest (7.8). The first step can be achieved by considering the 
partitioning of the expectation and covariance parameters of p g (x t , y t \y 1:t -i), which are given by 



and 



_ (€(x t ,x t \y 1 , t _ 1 ) < ^(x t ,y t \y 1 , t _ 1 )\ , k+p ^ k+p) 

Xt ' yt|yi:t " 1 " U(y t ,* t |yi :t -i) cfoyttott-iV 1 ] 

Recursive expressions of (7.11) and (7.12) in terms of the parameters of Pe(x t _ 1 ,y t _ 1 |y 1:t _ 1 ) can be obtained by 
means of the linear transformation theorem for Gaussian distributions, yielding (see below for details) 



4 Although 9 obviously represents a "set of parameters", or, if the elements of this were concatenated into a column vector, a 
"parameter vector", we refer to it simply as "parameter". This practice eases the linguistic transition between models 
comprising a single parameter and those comprising multiple parameters. 
5 In the derivation of the Kalman-filter equations, we follow (David Barber, 2012) 
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^x t ,y t \y 1:t . 



lyi=t-i \ e M fe +P 

llyi:t-l' 



(7.13) 



and 



B T (AX Xt _ ilyi[t _A T + Z x ) 



^ t ,y t \yi:t-i 



A^x^y^A 7 + Bi^x^y^A 7 + Z X )B T + Z y , 



£ ]g(fe+p)x(/C+p) 



(7.14) 



o Derivation of equations (7.13) and (7.14) 

The aim of this section is to show how recursive expressions for the parameters of the joint Gaussian 
distribution P0{x t ,y t \y 1 . t - 1 ) (denoted by M% t ,y t |y lt _ 1 and ^ Xt ,y t \y- L . t -- i ) can De obtained in terms of the parameters of 
the "preceding" joint distribution Pe(*t-i>yt-ilyi:t-i) (denoted by i u Xt _ 1 , yt _ 1 |y 1:t _ 1 and ^x^.y^y^)- In the 
main text, it was noted that the parameters H Xt ,y t \y 1 . t - 1 ar| d ^x t ,y t |yit-i P ar "tition according to 

H X y \y = ) = & lyi:t -\ } ) £ (7.15) 

and 

/ 2 Ar t ,% t |yi :t _i 2 wtlyi:t-i\ /£(x t ,x t |yi :t _i) CC^ytlyiit-i)^ c ns(fc+p)x(fc+p) / 7 i 7 \ 
2 yt , ydyi:t _J Utycxtly^t-O ^(yt»ytlyi:t-i)/ ' 1 ' 

Based on the Gaussian linear transformation theorem, it is acknowledged, that if Ve(x t _ 1 ,y t _ 1 \y 1 . t _ 1 ), then 
p (x t ,y t |y 1:t _ 1 ) is a Gaussian distribution as well, and inference on the distribution Ve( x t-i>yt-i\yv.t-i) ]S 
achieved by inference on its expectation and covariance parameter components. 

The first component of V-x t ,y t \y\ t-x re P resen ts the expectation of the random vector x t conditioned on the 
observations y\ X -i- Based on the dynamics equation of the LGSSM it is thus obtained by a linear transformation of 
the expectation of the random vector x t -\ conditioned on the observations y 1:t -i under the transition matrix A and 
additive zero-mean noise s t , in other words 

^tlyi:t-i = E (*tlyi:t-i) (7-16) 
= E(Ax t - ± + £ t lyi:t-i) 

= E(ilX t _i|yi: t _i) 



- A V-x t - ± \y x . x - x 

Likewise, the expectation of the observation y t conditioned on the observations y^-t-i iS given by the expectation of 
the linear transformation of x t under the emission matrix B and additive zero-mean noise r\ t ; and further, x t is itself 
given by the linear transformation of x t _ x under A, andd additive zero-mean noise s t in other words 

% t iyi:t-i = E (ytlyi:t-i) ( 7 - 17 ) 

= E(.Bx t + l?tlyi:t-l) 

= E(BAx t _ 1 + s t \y 1:t _ 1 ) 

= BAE(x t _ 1 \y 1:t _ 1 ) 
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- BA t i x t . 1 \y 1 ., t . 1 

Concatenation then yields the desired recursive expression for the expectation parameter of p e (x t , y t |yi :t -i) 



^ t ,y t \y, t -, ~ (bIT^Z-) (7 ' 18) 



The first component of ^x t ,y t \y- L . t - 1 represents the covariance of the random vector x t with itself conditioned on 
y-L-.t-i' ^i x t> x t\yi:t-i)- As x t ]S given as the linear transformation of x t _ x under additive zero mean noise s t with 
covariance matrix T. XI the Gaussian linear transformation theorem (conditioned on yi :t _i) applies. The expression 
for the covariance of x t with x t in terms of the covariance of x t _ x with x t _ x is thus given by 

C(* t ,* t |y 1:t _!) = ^x^y^.A 7 + 2* (7.19) 
Evaluating the covariance of y t with y t in terms of the covariance of y t _ x with y t _ x yields 

c(y t , yt lyi:t-i) = £(Bx t + r] t , Bx t + m |y 1:t _ x ) (7.20) 

= C(Bx t .Bxtly^t^) + I y 

= S(C(x t ,x t |y 1:t _ 1 )S T + I y 

= B(AL Xt _ x ]yi . t _ 1 A T + Z X )B T + I y 
Finally, evaluation of the covariance of x t with y t n terms of the covariance of x t _ x with y t _ x yields 

C(*t. yt |yi:t-i) = CU*t-i + e t - 5(^t-i + e t ) + r)t \yi-.t-i) (7.21) 
= {AZ Xt _ l]yi . t _ 1 A T + Z x )B T 

and hence also 

T 

C(y»x t \yi:t-i) = (C(x t ,y t |y 1:t _!)) (7.22) 

= BT (^ t _ 1 |y 1:t _ 1 ^+^) 
= (^ t _ 1 , yi:t _ 1 ^ + I x )B 

Concatenation then yields the desired recursive expression for the covariance parameter of Vei x t>yt\yi:t-i)'- 

\fi(^x M lyi:^ T + S *) B (AZ Xt _ ilyi . t _A T + Z x )B T + Z y ) 

□ 

The second step can be achieved by capitalizing on the Gaussian conditioning theorem, resulting in 

= + { A *x t -,\y, t -, AT + ^x)B T (B(AI Xt _ i]yi . t _A T + I X )B T + I y ) _1 (y t - BA^^) 
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(7.24) 



and 

* Xt \y,t = (^x^ly^A 7 + 2,) - (A^y^A 7 + ^{B^A^y^ + Z X )B T + S y ) _1 B^AY.^^ + X x f 

(7.25) 

o Derivation of equations (7.24) and (7.25) 

Based on the recursive expressions of the parameters of p e (x t ,y t \y 1 . t _ 1 ) in terms of the parameters of 
Pe (. x t-i>yt-i\yix-i)> tne recursive expressions for the parameters of p e (x t \y 1:t ) by means of applying the 
Gaussian conditioning theorem. Specifically, by setting 



z:=Q t ),x t eRP,y t eR k (7.26) 



and using 

POtbt) = N i x >i J -x t \y t ^x t \y t ) (7-29) 

where 

Mx t |y t = Mx t + Zx t y t Zyty t (yt — My t ) and 2 %t | yt = Z XtXt — Zx t y t Zyty t ^ytXt ( 7 - 27 ) 
yields by substitution of the respective expresssions (conditioned on y± : t-i)' 

Hx t \y 1:t = AUx^m + i^x^y^A 7 + * X )B T {B(AY. Xt _^. t _^ + I X )B T + I y ) _1 (y t - BAn Xt _ llyi ,_J 

(7.28) 

and 

= [A'Zx t - 1 \y 1:t - 1 A T + 2,) - {{AY- Xt _^,_^ + Z X ) B T (B (AZ^^A? + Z X )B T + ty)' 1 B^A^y^A? + E x )) 

(7.29) 

□ 

The expressions for fe t |y l t and ^x t |yi t above can be formulated more succinctly by setting 

P := AZ^y^A 7 + Z X E R kxk (7.30) 

yielding 

fed** = ^t-ilyi=t-i + PB T {BPB T + Z y y\y t - BA^^) (7.31) 

and 

2 * t |y 1:t = P - PB T (BPB T + I y ) _1 SP (7.32) 
Finally, by defining the so-called Kalman Gain Matrix 



K := PB T {BPB T + I y ) G M kxp (7.33) 
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the parameters may equivalently be expressed as 

Hx t \y 1:t = APx^y^ + ~ BA Hx t -^.. t -d ( 7 - 34 ) 

and 

S *tlyi:t =P-KBP = Q- KB)P (7.35) 

To avoid that the inferred covariance £x t |y l t , which is given by the difference of two positive-definite 
matrices in (7.35) and, hence, is itself positive-definite, becomes nonpositive-definite due to numerical rounding 
errors, Bucy and Joseph (Bucy & Joseph, 1987) proposed the following modified update rule, which includes the 
formation of a positive-definite matrix on every inference step, which is derived below: 

z * t |yi:t = ( 7 ~ KB ^> P = V~ KB ^> p ~ KB y + K ^y RT ( 7 - 36 ) 

o Derivation of equation (7.36) 

We show the equivalence of the expressions 

(/ - KB)P(I - KB) T + Kl y K T and (/ - KB)P (7.37) 

for the filtered covariance parameter 2x t |y l t / where the latter representation has the benefit of involving the 
computation of a positive definit matrix at every recursion step. 

(/ - KB)P{1 - KB) T + KZ y K T = (I - KB)P{I - B T K T ) + KI y K T (7.38) 



= (/ 


-KB)P 


- (/ - KB)PB T K T + KIy K T 


= (/ 


-KB)P 


- PB T K T + KBPB T K T + KIy K T 


= (/ 


-KB)P 


- PB T K T + K{BPB T + I y )K T 


= (/ 


-KB)P 


- PB T K T + PB T {BPB T + I y ) _1 (BPS T + I y )K T 


= (/ 


-KB)P 


- PB T K T + PB T K T 


= (/ 


-KB)P 





□ 

In summary, based on the initialization 

Mxdyi : = /4° and I Xllyi := (7.39) 

the parameters of the distributions p#(x t |y 1:t ) can successively be obtained by using equations (7.24) and (7.25) on 
the ith iteration of the exact VML-EM E-Step. However, as discussed in the main tutorial, the exact VML-EM requires 
conditional marginal distributions of the form Pe(x t |y 1:T ). Parameters of these distributions can be obtained by 
augmenting the Kalman filter recursions with a backward recursions starting from Pe(x T \y T ) as discussed next. 

KRTS Smoothing - Inferring p g (x t \y 1:T ) 

Like the conditional marginal distribution Pe(x t |y 1:t ) the conditional marginal Pe(x t |y 1:T ) of an LGSSM is a 
normal distribution and, hence, can be characterized by its first two central moments Mx t |y 1T an d ^x t |y 1T - ln 
analogy with (7.9) and (7.10), the aim of the following section is to derive the parameters of 

PeOtlyi:r) = N i x f^x t \y v . T ^x t \y l .. T ) (7-40) 
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in terms of the parameters of the "succeeding distribution" 

PeOt+ilyi:r) = N (. x t+i;Hx t+1 \y 1 .. T ,Z Xt+1 \y 1 . T ) (7.41) 

A backward recursion starting from the final result of the Kalman Filter recursion Pe(x T |y 1:T ) may be derived by 
noting that for the LGSSM the joint distribution over temporally adjacent latent variables conditioned on y 1:T may be 
written as (see below for details). In the derivation of the KRTS update equations, we follow (Yu, Shenoy, & Sahani, 
2004). 



„ ( v v U, A _ Pe(x t+ l\Xt)Pe(Xt\yi:t)Pe(Xt + l\yi:T) n 

Pe^ t ,x t+1 \y 1:T ) - Pe( , t+1 | yi:t) < 7 - 42 ) 



o Derivation of equation (7.42) 

We first note that 

Pe Ot< x t+1 \y 1:T ) = p e (x t \x t+1 , y^ve Ot+i \Vi-.t) (7.43) 

We next note that x t is independent of y t+1 given x t+1 , i.e. given x t+1 , x t and y t+1 are conditionally independent, 
and one may write 

Pei.x t \x t+1 ,y 1 , T ) = p (x t |x t+1 ,y 1:t ) (7.44) 

The conditional independence of x t and y t +v.T given x t+1 may be derived by the using the d-separation criterion in 
the graphical model framework (p. 378 in (Bishop, 2007)): because every path from x t to y t+1 . T is blocked by the 
node x t+1 , i.e. the arrows on the respective paths meet head-to-tail at the node x t+1 (see Figure 3A of the main 
tutorial), x t is conditional independent of y t+1 . T given x t+1 . In other words 

x t 1 y k+1 \x t+1 for k = t, t + 1, ... , T - 1 (7.45) 

We hence have 

Pe(*t<*t+ilyi:r) = Pe(*tl*t+i<yi:t)Pe(*t+ilyi:r) (7-46) 

□ 

Considering the first term on the right-hand side of the above, we have, again with the factorization properties of 
the LGSSM 

_ Pe(*tlyi:t)Pe(*t+il*t.yi:t) 

Pe(x t +i\yi:t) 
_ Peixtly^pgjxt+^xt) 

Pe(x t +i\yi:t) 

where the conditional independence of x t+1 and y vt given x t follows from 

Pe(x t+ i,yiM = P9{x ;2l^ (7-48) 

_ Pe(xt)Pe(xt+i\xt)Pe(yi:t-i,yt\x t ) 
Pe(xt) 

= Pe(^+ikt)Pe(yi:tl^) 
Taking logarithms and substituting for the distributions on the right hand side of (7.48) then results in 

In p e (x t , x t+1 \y 1:T ) = - \ (x t+1 - AxtfT^ (x t+1 - Ax t ) (7.49) 
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~\{ x t -i"x t |y 1:t ) ( S x t |y 1:t ) ( x t ~ Mx t |y 1:t ) 

_ 2 ( x t+l _ i"x t+1 |y 1:T ) ( 2 x t+1 |y 1:T ) ( x t - i"x t+1 |y 1:7 .) 

+ 2 ( x t+l ~ ^t+ilyi:t) ( S ^t+ilyi:t) ( x t ~ i"* t+1 |y 1:t ) + C 

where C £ E denotes a a constant independent of x t and x t+1 . Rewriting the above in terms of first and second 
order in x t and x t+1 and comparing these to a standard Gaussian distribution partition allows for inferring the 
entries of the conditional covariance matrix 2 Xtj x t+1 | yi 7 ,. This conditional covariance matrix partitions according to 

x t ,x t+1 \ yi . T {l Xt+iiXt]yiT l Xt+l]yi . T ) \C(.x t ,x t+1 \y 1:T ) C(.x t+1 ,x t+1 \y 1:T )J 

(7.50) 

The diagonal entries of ^x t ,x t+1 |y 1T i n turn allow for inferring the following recursive KRTS update equations (see 
below for details) 

V-x t \ yi . T = Mx t |y 1:t + Zx t \y 1:t AT ( A Z Xt \y 1:t A T + Z x ) {^x t+1 \ yi . T ~ A ^x t \y 1:t ) ( 7 - 51 ) 

and 

Z Xt \y,r = Z xt]yi:t + (z xt]yi ,A T (AZ Xt]yi . t A T + {z Xt+llyi:T ~ ^ Xt]yi . t A T + (z xt]yi . t A T (AZ Xt]yi . t A T + E,)" 1 )" 

(7.52) 

Notably, these recursive backward expressions for the smoothed expectation [i Xt \ yi . T and covariance parameters 
I Zt | yi7 , contain only the filtered conditional expectation li Xt \y vt > the filtered conditional covariance ^ Xt \y vt < the 
transition matrix A, and the state noise covariance I. x . The ensuing algorithm is referred to as KRTS smoother. 

o Derivations of of equations (7.51) and (7.52) 

As outlined above, we derive the KRTS update equations by means of the conditional covariance matrix 
T. Xt jXt+1 \ yi . T following the approach of (Yu, Shenoy, & Sahani, 2004). This approach capitalizes on the following two 
matrix identities 

(A + UCVy 1 = A' 1 - A^UiC' 1 + VA^Uy^A- 1 (7.53) 

and 

(/ -(A + BY 1 A)B- 1 = (A + B)~ 1 (7.54) 

for appropriately specified matrices A,B,C,U and V. (7.53) is known as the "Woodbury identity" or "Matrix 
Inversion Lemma" and its verification is left to the interested reader. (7.54) is verified as 

(A + B^iA + B) = 1 (7.55) 
<=> 04 + By 1 A + (A + By x B = I 

d(A + By x B = I — (A + By 1 A 
<^> (A + By 1 = (/ - (A + By 1 A)B~ 1 
^ (/ - (A + By 1 A)B~ 1 = {A + By 1 



54 



Further, the approach of (Yu, Shenoy, & Sahani, 2004) uses the formulation of the quadratic form in the exponent of 
multivariate Gaussian distributions as introduced in Supplement Section 2 and notated here 6 as 



= — 2 Z 1-^11 Z 1 ~~ 2 Z 1"^ 12Z 2 ~~ 2 Z 2^21 z l — 2 Z 2-^22 Z 2 + z 2(-^2lMl + -^22^22) + C 

for p(z 1 ,z 2 ) = N f( z *) ; (^*) w ' tn su ' ta ' 3 'y cr| osen vectors z 1 ,z 2 ,fi 1 ,fi 2 and block matrix A, denoting the 

inverse of the covariance matrix of p(z 1( Z2) and C £ DL In addition (Yu, Shenoy, & Sahani, 2004) note that for the 
inversion of block matrices, the following equalities hold. Let 



£ 12 \ ._ A 12 \ x ^ ^ 

\^21 ^22/ VA21 A 22 / 



and 



Tn : = A X1 - A 12 A 2 iA2i and T 22 := A 22 - A^A^A^ (7.58) 



Then 



/An a 12 \ 1 / rn 1 -rr^A^A . . 

U21 A 22 J {-A-AAiJR 1 & ) (Ab9) 

= Ml 1 1 +Al 1 1 A 12 r 2 -2 1 A 21 Al 1 1 -^Aii^ii \ 

\ ~ ^22^21^11 A 22 + A 22 A 21 r 11 1 A 12 A 22 / 

Again, the verification of (7.59) is left as an exercise to the interested reader. 

Based on the set of preliminaries (7.54) - (7.59), equations (7.51) and (7.52) may now be derived from (7.49) 
as follows. As a first step, the right hand side of (7.49) is reordered in terms of second-order terms in x t+1 , first-order 
mixed terms in x t+1 and x t , second-oder terms in x t and first-oder terms in x t . This reordering results in 

lnp(x t ,x t+1 |y 1:T ) = -ix t T +1 (i:- 1 -(l Zt+i | yi:t ) 1 + (l Xt+i | yi:T ) )x t+1 (7.60) 

— 2 x t+i ( — 1 A)x t — - xj (— AY. X 1 )x t+1 

+xj (( 2 x t |y 1:t ) ftc t |y l!t ) + C 

Term-by-term comparison of (7.60) and (7.56) then allows for inferring the entries of the inverse covariance matrix 
of p(x t+1 ,x t \y 1:T ) to be given as 



6 In contrast to Supplement Section 2 we here use numerical indices to avoid confusion with the nomenclature for latent and 
observed variables in the LGSSM context. 
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^x t+1 \y 1:T S x t+1 , t |y 1:7 A _ f^x 1 (X t+1 |y 1:t ) + &x t+1 \y 1:T ) ^A 
Z x t , t+1 |y 1: r Z x t |y 1: r / \ -AT' 1 A T T^A + (Z Xt | yi:t ) 



-l 

J 



(7.61) 

To obtain the covariance matrix of p(x t+1 ,x t |y 1:T ), the block matrix on the right-hand side of (7.61) has to be 
inverted. To achieve this, and simultaneously derive (7.52) (Yu, Shenoy, & Sahani, 2004), first simplify A22 and 
A22A21 in (7.59) in terms of the entries in (7.61). Specifically, using the matrix inversion lemma 



A 22 ~ - (A T I. X 1 A + 2 %t |y 1:7 -) - Z x t |y 1:t - S x t |y 1:t ^ T (^ t |y 1:t +l) A ^x t \y vx (7-62) 

and defining 

h ■■= ^ t |y 1:t ^(^ t+1 |y 1:t ) _1 (7.63) 

we obtain 

A22 = £* t |y 1:t -Jt^x t+1 \yjt (7-64) 

Likewise, by means of the matrix identity 

(/ -(A + BY 1 A)B- 1 = G4 + B)- 1 (7.65) 

we obtain 

AiiA 21 = -% t \ yi:t -h*x t+1 \yJ T t)A T ^ = -J t (7.66) 

After this simplification, the block-matrix on the right-hand side of (7.61) may be inverted to obtain the entries of the 
covariance matrix of p(x t+1 ,x t \y 1:T ) according to (7.52). Specifically, we have 



and 



^x t \y 1:T := A 2 2 + A22A2ir il 1 A 12 A22 L (7.68) 

= ( S *t|yi:t ~ Jt^X t+1 \ yi Jt) + (-Jt)Zx t+1 \ yi . T (-Jt) 

= 2 *tlyi:t + Jt(Zx t+1 \y 1:T ~ ^x t+1 \y 1:t )Jt 

= ^x t \y 1:t + {^x t \y 1:t A 7 (X t+1 |y 1:t ) ) &x t+1 \y 1:T ~ S x t+1 |y 1:t ) ( 2 x t |y 1:t ^ T ( S x t+1 |y 1:t ) ) 

Noting that with the Gaussian linear transformation theorem 

^ t |y 1:t - 1 =^ t _ 1 |y 1:t - 1 ^ + ^ (7-69) 

and thus 

*x t+1 \y 1:t =AX Xt \ yi!t A T + l x (7.70) 
we have thus (7.52), i.e. a backward recursive expression for Z Xt \ yi . T i n terms of the filtered covariance matrices 
I Xt | yi t and the parameters of the LGSSM 
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*t+l\yi:T 



Finally, to find the mean, we compare the last terms in (7.56) and (7.60), i.e. 



Apparently, we have 



We hence infer 



Z2(A 21 /i! +A 22 ^ 2 ) and xj (fi Xt \y lt ) V* t |yi :t ) 



A 2li"x t+1 |y 1:7 . + A 22i"x t |y 1:7 . ~ ^Wt^tlyi 



Solving for l*-x t \y x . T > we obtain 



Next, using 



we obtain 



#»:tlyi:r _ A 22 A 21l"x t+1 |y 1:T + A 22 2 x t |y 1:t Mx t |y 1:t 



A 22 1 A 21 = -A and A 22 x = S xdyi:t - J^x^yJ't 



= /tMx t+1 |y 1;7 - + Mx t |y 1:t ~ Jt A Hx t \y 1:t 
= Px t \y 1:t + JtQ*x t+1 \y 1:T ~ A ^x t \y vx ) 

= Vxtfrit + ^ t \y 1:t AT (^x t+1 \y 1:t ) )(Mx t+1 \y 1:T ~ A ^x t \y vx ) 

= ^x t \y 1:t + : x t \y 1 .. t AT { A X Xt \y 1 . t A T + l x ) )(fe t+1 |y 1:T ~ 4"x t |y 1:t ) 



Finally, equations (7.51) and (7.52) may be written more succinctly by defining 

A : = 2 x t |y 1:t ^ T (^x t |y 1:t ^ T + 2x) 

resulting in 

^*tlyi:r = ^tlynt + A(l"x t+1 |y 1:T _ ^i"x t |y 1:t ) 

and 

2 *tlyi:r = 2 ^tlyi:t + ^ ( S ^t+llyi:T ~~ i A ^X t \y 1]t AT + ^x))jt 



This recursion is initialized as 
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* X r-,n yi .. T = ^r\y,rJ T T-i = 0~ ^ XT \y,r-A T + ^)S T (S y + B(^ T _ l|yi:T _/ T + T x )B T y 1 B)AT XT ^ T 

(7.81) 

by substitution of the recursive expressions for 2 X | y and the definition of J T ^ 1 . Notably, like the recursions for 
/**tlyiT and ^ Xt \ yi . T , this backward recursion contains only the filtered conditional covariance ^ Xt \y vt < the transition 
matrix A, and the state noise covariance 2 X . The recursion can be initialized using 



(7.82) 

o 

To complete the current section, we demonstrate how these are obtained. We first note that, in parameterized 
form, the smoothed latent variable first- and second-order moment are given as 

( x t)p e(i) (x t \y 1:T ) = fte t |y 1:r and ( x t x t)p e(i) (x t \y 1:T ) = i"x t |y 1:T i"x t |y 1:7 . + S * t |y 1:T ( 7 " 83 ) 

and the second moment of the pairwise conditional marginal is given as 

( X t-l X t }p e (o(*t-l.*tlyi:70 = ^*t-llyi:-r^tlyi:T + 2 *t-l:t|y 1:T ( 7 " 84 ) 

The first term on the right-hand side of (7.82) capitalizes on the availability of the smoothed latent variable 
expectations. A recursion for the second term of (7.82) can be obtained by considering the off-diagonal elements in 
(5.2.24) yielding (see below for details) 



2 * t _ 1:tb , r - ^x t \yjt ( 7 - 85 ) 



o Derivation of equation (7.80) 

According to 

and with 
we have 



2 *t,t+ilyi : r ~~ "-^^lr/ii 1 (7.86) 



Z * t+1 |y 1: r = T ii and A 22A 2 i = ~Jt (7-87) 



2 *t:t+t|y 1:7 . ~~ ^t+llyi:^ ° 2 *t-l:t|y 1:T ~~ ^X t \y 1:T Jt-l (7.88) 

Substitution of (D.5.39) and (D.5.34) then yields 

2 x t -i :t |y 1:T = ( Z x t |y 1:t + A(X t+1 |y 1:T ~ 2x t+1 \y 1]t )Jt)Jt-l ( 7 - 89 ) 
= ( S ^tlyi:t + Jt(Z Xt+1 \y 1:T ~ A ^x t \y 1:t j) Jt-l 
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+ (?x t \y 1:t AT (■AZx t \y 1:t A T + %x) ) (^x t+1 \y 1:T ~ AS x t |y 1:t ) {^x t - 1 \y 1 .. t - 1 AT {. A '^x t - 1 \y 1 .. t - 1 AT + S * 
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8 Unified Inference for LGSSMs 



In contrast to the exact VML-EM algorithm, on a given iteration of the VB-EM algorithm it is not the 
parameters 9 which are fixed to specific values, but the variational distribution q(8) over parameters. Further, the 
general update rule for the VB-EM E-Step corresponds to (cf. equation (17) of the main tutorial) 

<7 a+1) (*i:r) K exp ((In p(x 1:T ,y 1:T , 0)^(0^) (8.1) 

The inference procedures considered in Supplement Section 7, which result in the specification of p(x t |y 1:T ) and 
p(.x t _ lt \y 1:T ), however, are applicable in the case of known parameter values. Intuitively, one might think that, for 
the E-Step of the VB-EM algorithm, the parameter expectations under the variational distributions, \.e.,(6) q (i)^ g y 
may be forwarded to the KRTS smoothing algorithm. However, this is not appropriate, as one can show that for the 
LGSSM the expectation of the energy function under the current variational distribution over parameters q^iB) 
does not equal the energy function with averaged parameters. In other words, in general, 

(In p(.x 1:T ,y 1:T \G)) qV ) ig) * \^v(^vj,yv.T\{Q) q ^(e^) (8.2) 

Barber and Chiappa (D. Barber & Chiappa, 2007) have shown how, nevertheless, standard inference algorithms may 
be applied for the E-Step of the VB-EM algorithm. In the following, the inequality above is discussed in further and 
the solution provided by (D. Barber & Chiappa, 2007) is introduced. 

The inequality (8.2) is summarized in (D. Barber & Chiappa, 2007) as the "mean and fluctuation 
decomposition theorem." Specifically, it can be shown that the expectation of the conditional energy function under 
q^'{&) can be written as the sum of two terms: (1) the energy function conditioned on the parameter expectations 
(0) q (O(-0) and (2) a fluctuation term. Specifically, using the LGSSM inherent factorization of p(x 1:T ,y 1:T |0), one 

obtains 

(In p(x 1 , T ,y 1 .. T \9)) q (i) (e) (8.3) 
= (In (p(x 1 |0)n[= 2 P(^l^-i,0)n[=iP(ytl^,0))Vo (0) 

= (ln(p(Xi Xi))) q (Q ifllXl) + Z[=2<ln(pOt \x t -i,A, ^x))) q ^ {AXx) + SLi 0" (p(yt \x t , B, £ y ))>q(0 (B , Zy) 

Writing out the Gaussian forms of the probability distributions involved and summarizing terms independent of the 
latent variables x 2:T in a constant C £ R yields 

(mp(*l:T,yi:r|0)>q(O(0) = ((x x - Ml) 7 ^ ~ Ml)>q(0( Ml , Zl ) (8-4) 

-\YJt=2{{y t ~ Bx t yz-\y t - Bx t )) q(i){BXy) + C 

o Derivation of equation (8.4) 



We have 

{\r\p{x 1 , T ,y 1 , T \9')) q{e) 



(8.5) 



= On P(*1:T, Vl:T \lh, A, B, S x , I x , Xy^q^ABX^Xy) 
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= (In p(x 1:T , y 1:T \Hl, A, B, S x , Z x , Zy^q^XJqiAXxMBXy) 

= (ln(p(x X |M X , li) rit=2 P( X t I A, X t - lt I x ) Ilt=l P(y t |S, X t , Zytyq^XMAXxMBXy) 

= (ln(JV(x i; Mi, Si) rit=2 N(x t ; ^t-i- Ilt=i N(y t ; Bx t , X y ))) qitll z l)q ( A ,i: x ) q (B,x y ) 

(-- -i 1 \ 

(2tt) 2|S x | 2exp(--(x! -Mi^S^Oi - Mi))J W^M^M^) 

+ (In (n[=i(27r)-2|l y |-2 exp ^_i (yt _ B^)^" 1 ^ - BxS))) qilllXl)qiAXx)q{BXy) 
= (In ((2tz:)~2 |S x |~2 exp (- i (x x - Mi) 7 ^ H*i ~ Mi)))) 'gCni^i) 

+ (ln (nL2(27r)^|S x |4exp (~0t -i4x t _ 1 ) T Ej 1 (^t ~ ^ x t-i))) W,) 

+ (ln(n[=i(27r)-2|l y |-^xp(-i(y t -Sx^I-Hyt -Sx t )))) q(B , 2y) 
= (-^lnZjr-ilnllil -i(x x -/i 1 ) T Ii 1 (x 1 -/ij^^^^ 

+ ( _MZ^)i n27r -^ln|I % | -\Yj t=2 {x t -Ax^Yl-^Xt -M-i)W,^) 

+ <- f In 2tt - ^ln|S y | - ilLi(y t - Bx^l- 1 ^ - Bx t )) q{BXy) 
= -j\r\2n - (^lnlSiDq^j - ((x x - Mi) r 2r 1 (x 1 - Mi))^,^) 
^ln27r-(— ln|2 x |)q (AZJ --2[= 2 ((^ -^t-i) T ^ 1 (^t -^t-i))q(Ai: x ) 

-Y ln2 7T-(^ln|l y |)q( B , Zy ) -^lLl<(yt -SXtflyHyt - B^t))q(B,i: y ) 

Summarizing all terms independent of x x , x 2 , ... , x T as a constant C £ IR, one obtains 

(lnp(x 1:T ,y 1:T |6i)>q (0) = -((x x - j u 1 ) T ir 1 (^i -i"!))^^!) (8-6) 

- ilLi<(y t - ^ x t) T ^y 1 (yt - Bx t )) q(BXy) + c 

□ 

The right-hand side of (8.4) may be decomposed into the contribution of the exponent of an LGSSM with averaged 
parameters (A) q a) iA ^ x y (B) q a)^ B ^ y y (^x) q (0^ x ) and (2y}q(i)( 2 ) and "fluctuation" terms (see below for details) 

(lnp(x 1:T ,y 1:T |6i)> q (o (0) = \np(x 1:T ,y 1:T \(6) qi o Q0) ) + F AXx + F BXy (8.7) 

where 
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T 

T 



(8.8) 



and 




(8.9) 



and 



(8.10) 



Given that the additional fluctuation terms F A ^ x and F B 2 do not vanish in but the trivial cases, standard LGSSM 



in the VB-EM E-Step. In order to nevertheless use standard inference algorithms also in the Bayesian framework (and 
thus capitalize on the vast literature that has dealt with inference problems for LGSSMs in the past), the idea is to 
augment the observed variables and parameters of the "averaged LGSSM" (lnp(x 1 . T ,y 1:T \0)) appropriately. 
Specifically, by applying standard inference algorithms to an augmented LGSSM log joint distribution 
lnpg(x 2:T ,y 2 :T), where y 2 - T denotes an augmented set of the observed variables y t and 6 and augmented 
parameters set in terms of the visible variables and parameters of (lnp(x 1:T ,y 1 . T \0)) the latent variable 

distributions of the average LGSSM may be inferred. In other words, it can be shown that, for the appropriate 
specification of x 1:T ,y 1:T and 6 in terms of y 1:T and 6, 



The required variable and parameter augmentations are specified in in the following theorem and the above 
equation is verified below. 

Unified Inference Theorem 

To infer the distribution lnp(x 1 . T \y 1 . T ,0') using standard inference algorithms for the expectation of the LGSSM 
Znp(x 1:T ,y 1:T |0) under the variational parameter distribution denoted by 



inference algorithms supplied with the expected LGSSM parameters (flL&Vm are thus not appropriate for inference 



In Pe (x 1:T ,y 1:T ) = Qnp(x 1:T ,y 1 . T \0)) q w 



(8.11) 



{lnp(,x 1:T ,y 1:T \e)) q&w 



set 



y t ~ fc and 8 == {fl 1 ,I 1 A, I x ,B,E y } 




where 



-1 
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with Cholesky decompositions U A and U B given by 

Ul U A : = ( AT Zx lA )q(A,Z x ) - (A T )q(A\Z x )(Zx 1 )q(_Z x )( A )q(A\Z x ) 

and 

U T B U B := {B T E^B) q{BiSy) - (B T ) q{B \ Sy) (S^) q{Sy) {B) q{B ^ y) 

and infer 

lnp(x 1:T \y 1:T , 6) = (lnp(x 1:T \y 1:T ,6)) qW 

□ 

o Verification of equation (8.11) 

We verify that 

lnpgO 1:r ,y 1:7 = (^P(x 1 ,T,y 1 .. T \9)) q{e) (8.12) 

for the augmented observed variable vector 

y t == (o^j (8.13) 

and the following augmented parameters 

Mi : = Ml, 2i == 2 X , i == Wqciz,),^ == ((^>q ( z ;c ))" 1 (8.14) 

and 



'<*><(*|i y )\ / (^>^ y )) _1 



J kxp 
)fcxp 



'pxfc 


Opxfc 1 






Ofexfe 


| (8.15) 


'fcxfc 


/fc / 





where 

UjU A == (^S-M),^) - (A\ {Al z x) (Z;% {Zx) (A) q(AlZx) (8.16) 

and 

£/Jt/ B := {B T ^B) q{BXy) - {B^f^VyX^ (B) q(BlTy) (8.17) 

by substitution. On the one hand, this verification implies that indeed the inequality (8.2) holds, while on the other 
other hand it simultaneously justifies the application of KRTS smoothing to the augmented LGSSM. We have 

lnpg(x 1:T ,y 1:T ) = ln(N(x 1 ;M 1 ,S0n[=2^(^t;^t-i,2jnLi^(yt;Bx t ,S y )) (8.18) 
= - \ In 2n - \ In \t x \ - \ (x x - Mx) 7 ^ 1 Oi " Mi) 

_^ ln27r _Z^ ln |s x | -\jS^((x t -Axt-J&ixt -Ax t _S) 
_ (p^ocr) ln ^ _ t ln| ^ | _ i ^ _ g Xt) ^-i (yt _ fot) ) 
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Summarizing all terms independent of x 1;T into a constant, we obtain 



lnpg(x 1:T ,y 1:T ) = --Oi 1 (x 1 -fa) 

~ 2 Ht=2( X t — Ax t-l) 2-X 1 ( X t ~ Ax t-l) 

~ \YJt=i{yt - Bx t ) T t-\y t - Bx t ) + C 
Substituting of the augmented observed variable and parameters then yields 

lnpg(*i:r<yi:'r) = ~\i x \ -^iY^i 1 ^! ~ Mi) 

~\^ T t=2 (( X t - (A)^^^) 7 (I x )-{ Tx) (x t - ( A ) q{A \Z x ) x t-l)) 



fcxp J fc u kxk 

fcxp o fcxfc 4 



= -\(x 1 - /ii) T 2i x Oi - 




^\ r /(^)^ y )) _1 pxk pxk \/<B> 



q(B|Z y ) 



" 2 ° fc °*x P /* kxk U A \x t 



Ofexp Ofcxfc h / ^ B 



'<B> q (B\i: y )\ /(Gy)^)) pxk pxk \ f(B)q{B\z y ) 
U * I kxp / k kxk ^ 

u b / \ o kxp o kxk / k / \ y B 

= ~Oi -Mi^lVi - Mi) 

_ 2^=2( x t _ ( A )q(A\T x ) x t-l) (Z x )qll. x )( x t ~ ( A ) q(A\X x ) x t-l) 

- i!Liy t (<£ y >^ y) ) yt ~ 2% (<£ y >^ y) ) <«> fl ( B p y )^ 

+^(5>; (B | 2y ) (^y>q(z y )) 1 < s > Q («|2 y )^ + *t T tfJ^* t + 4f/Jf/ B x t 



= --(%- /ii) T Ii 1 (X 1 ~ fit) 



~\TJt=2{ x t ~ ( A )q(A\X x ) x t-l) (Zx)qll. x )( x t ~ ( A ) q(A\T x ) x t-l) 
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- iZLi (yt ~ (B) q ( B] x y )X t ) T < 2 y)q(s y ) (yt - ( B > £? ( B |z y )^) 

-lT. T t=1 xI«B T ^B) q{BXy) - {B% {B ^ y) {^\^ y) {B) <B ^ y) )x t 
which yields 

lnpgO 1:T ,y 1:7 = lnp(aci :T< yi :T |<0> fl (e)) (8.21) 

+ Zt=l x ?((^ T2 x 1 ^>q(A2 x ) ~~ (^ T )(?U|2 X ) ( 2 x 1 )q(Z x )(^)qU|Z x )) x t 
+ ZF=l4((S T S- 1 B) £?(B , 2y) - (B T ) q{my) m\^ y) {B)q{B\L y )) X t 

= Qnv{xwym\o))q{0) 



Hence, in the augmented form, one has a standard LGSMM for which the Kalman-Rauch-Tung-Striebel algorithm can 
be used for inference. On the other hand, the augmented form is equivalent to the expectation of the LGSSM under 
the parameter distribution q(8) as required by the VB-EM algorithm E-Step. 
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9 Mathematical details of the tutorial example 

9.1 VB for the univariate LGSSM 

We assume the following generative model 

P{XO:T,yi:T,V-0,Oo,a,0%,b,0%) := (9.1) 

pOo)pOo )p(a)p(.<r%)p(.b)p(<T$)p(.x \n , (T 2 ) n[=i p(*t l*t-i< a, o%)p(y t \x t , b, er 2 ) 
For convenience, we work with precision instead of variance parameters, i.e., we set 

(Tq : = K 1 . °x '■= K X , and ct 2 := Xy 1 (9.2) 
We further assume that the initial state x is known as p(x = 1) = 1, which, given 

p(x l/io, cr 2 ) = NOd/io, er 2 ) (9.3) 
may be motivated by setting n p(/i = 1) = 1, p(er 2 -» 0) = 1. The generative model in xxx thus simplifies to 

p{x VT ,y V T,a,X x ,b,Xy) = p{a)p{X x )p{b)p{X y )X\l =1 p{x t \x t _ 1 ,a,X x )p{y t \x t ,b,X y ) (9.4) 
The probability density functions on the right hand side of (9.4) are assumed to be given as 

p(a) := N(a; \i a , er 2 ) = (2n er 2 )~2 exp (- ^ (a - //J 2 ) (9.5) 
p(X x ) := G(A x ,a lx ,b Ax ) = ^^/^ex P (-^) (9.6) 

p(b) :=N{a;n b ,ot) = (2tt exp ("^0 - v b ) 2 ) (9.7) 

p(X y ) == G [x y ; % , b Xy ) = ^^r y <^ «p (- (9.8) 

i 

pCxtlx^^a,^) := NO^ax^,^ 1 ) = (^) 2 exp (-^(x t - ax^) 2 ) (9.9) 

i 

p(y t |x t ,M y ) == N{x t \bx t ,X y 1 ) = (£fexp(-±(y t -bx t ) 2 ) (9.10) 

Next, we assume the following factorized form of the variational approximation to the posterior distribution over the 
unobserved variables 

p{x VT ,a,X x ,b,X y \y VT ) « q{x 1 , T )q{a)q{X x )q{b)q{X y ) (9.11) 

were we set 

<7(*i:r) : = rit=2q(^-i^t) = Ut=2P(.x t - lt x t \y 1:T )Ut=2P ( x t-i<^;^ t _ 1:t | yi:T ^x t _ 1:t | yi:T ) (9-12) 
and for VB-EM iterations i = 0,1,2, ... 

q© (a) := jv(a;ri .^ (0 ) = (27ra 2 «) 4 exp (- ^ (a - M «) 2 ) (9.13) 
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q^X x ) == G ft®) = * * V^expf-^) (9.14) 

V ^/ x x \ x x / 

x x 

q(b) = N(a;$,af>) = (2ira 6 2 ®)" 1 exp (- (ft - /,«) 2 ) (9.15) 

,(A y ) := C (l y ; a®, ft®) = - ex P (- & ) (9 " 16) 

Ay 

Note that parameters of the prior and variational distributions are distinguished by means of the iteration 
superscript (i). 

According to the variational Bayes theorem for factorized posteriors, the variational free energy is maximized for 

q&t) « exp(Jq(t9 v )hip(y,t9)dt9 v ) (9.17) 

For the current example, we have 

y = y VT andtf := {x 1:T ,a,A x ,b,A y } (9.18) 

where the unobserved variables partition according to 

q&) = Uq&i) ■•= q(x 1:T )q(a)q(A x )q(b)q(A y ) (9.19) 

We thus obtain the following update equations, using the short hand notation (f(x)) p ^ •■= / p(x)/(x)dx for 
expectations 

<7 a+1) Oi:r) « exp ((lnp(y,x 1:T ,a,A x ,ft,A y )> (?(0(a)(? (o (A;c)(? (o W£? (0 (Ay) ) (9-20) 

q( i+1 Xa) oc exp (<ln p(y, x 1:T , a, X x , ft, ^qa^^qCO^co^co^) (9.21) 

9 (t+1) (Ax) K exp (<lnp(y,x 1:r ,a,A x ,ft,A y )) q(i)(Zir)£?(i)(a)£? (o W£? (o (Ay) ) (9-22) 

a exp ((lnp(y,x 1:T ,a,A x ,ft,A y )) £?m(XiT)£?(0(a)£? ( 0(A;c)q (o (Ay) ) (9.23) 

<? (t+1) (Ay) « exp ((lnp(y,x 1:T ,a,A x ,ft,A y )) £?(0(XiT)(?( o (a)£? (o (A;c)£? (O w ) (9-24) 

9.2 Evaluation of 

Applying the parameter augmentations discussed in Supplement Section 8 to the univariate LGSSM, one obtains 

% = Co) (9-25) 



a = (a)q{a) = Ma 
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_ f(b) q ( b )\ 
b = \ u A = 
V u B J 



[ 



^(a 2 Wq(a)q(A x ) ~ ( a ) 2 q(a)( A x)q(A x 



\ 



/i — \ 

\Jo*6 + °l)a Xy b Xy - \i\a Xy b Xy J 








1 

1/ 








1 
V 



9.3 Evaluation of <f (a)qr(A JC )<jf(ib)qr(A J ,) 

To derive the update equations for the LGSSM, we take the following approach: We first reformulate the expectation 
of the joint distribution of the LGSSM and parameters, p(y, x 1:T , a,A x , b,Ay), under the variational mean field 
approximation q^ i+1 \x 1:T )q^\a)q^\A x )q^(b)q^(Ay). We can then in turn derive for each variational 
distribution by ignoring its contribution to the expectation based on the variational Bayes theorem. 

o Step 1 - Evaluation of the variational expectation ofp(y, x 1:T , a, A x , b, Ay) 

(\n P (y, X 1:T , a, A x , b, ^y)>qa + i) (%1:T) q(0 (a) q(0 (A;c) q(0 (&) q(0 (Ay) (9.26) 
= (\np(y, X 1:T \a, A x , b, A y )p(a, A x , b, ^y)>qa + D (Zl:T )q(0 (a) q(0 (A;c) q(0 (ft) q(0 (Ay) 

Under the assumption of a factorized prior distribution 

p{a, A x , b, A y ) = p{a)p{A x )p{b)p{A y ) (9.27) 

and with the likelihood 

p{y,x 1 , T \a,A x ,b,A y ) = \\ T t=2 p(x t |x t _!, a,A x ) U T t=iP(yt\x t ,b,A y ) (9.28) 

we obtain 

(In P(y, X 1:T , a, A x , b, ^y)>qa + i) (Xl:T )q(0 (a) q(0 (A;c) q(0 (&) q(0 (Ay) (9-29) 
= lL2(lnp(x t |x t _ 1 ,a,AJ>q(i + D fe _ 1:t) q(0 (a) q(0 (A;e) 
+ St=l(lnp(y t \x t , b, ^ y )>qa + i) (Xt) q(0 (&) q(0 (Ay) 

+(lnp(a)) £?(0(a) 

+(\np(A x )) q(i)(?Lx) 

+(\np(b)) q(i)(b) 

+(lnp(A y ))q( i)(Ay) 

Next, substitution the respective forms for the probability density functions, we obtain 

(In p(y, x 1:T , a, A x , b, A y)) q ^ {Xl .. T ) q ®{a) q ^{7i x ) q ^ (b)q® ( 9 - 3 °) 
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= U=2 dn (g) 2 exp (- ± (xt - «t-i) 2 ) j>«i^&c Mit )«i»(a)«i»CW 
+ SLi On (S) 2 exp (- ^ (y t - ^ t ) 2 )^ a+1 ) (Xt)£?(0( , )£? (o (Ay) 

+ < ln (fe) 2 ex P (- £f ( a - ^°) 2 ))>«®(a) 



+ <ln (^^ A/ " _leXP ("l)j V °^ 

+ < ln (fe)' ex p(-^( ft -^ ) 2 ))v> w 



+<ln ^^ A/i?_lexp l-tiJ >q(i) ^ 

Evaluation of the logarithmic expressions then yields 

(In p(y 1:T , x 1:T , a, X x , b, ^y)> (J «+i)( Xl:I .) (f «)(o) (I co (A;e)(I co (l , )(I co ( ^ ) (9-31) 
-(^|(a-^ l) ) 2 Vo (a) -iln27ra a 2 

-<4(*-rfO\ c °c»4 ln2,ro > 

- ln ( r «)) - ln + - < ln ^ W y ) - jnWw 

Ay 

Evaluation of the quadratic terms then yields 

<lnp(y, x 1:T , a, A x , b, A y )> <l «+0( Xl:I .) (f «)(B) (I co ( ^ )(I co (l , )(I co ( ^ ) (9-32) 
= ^ ln (if)V°(A,) -I< A *St=2^t - 2al[ =2 x t _ 1 x t + a 2 Z[=2^-i^-i)> q a + i) (%t _ i:t)£? (o (a) q(0 (A;c) 

__L (a 2 _ 2a/ ,« + (M®)\(0 (a) - jln27ra 2 
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- m (r « } )) - ln ( b Sf X + « } - < ln ^ W y) -pQy)«K> 

Xy 

o Step 2 -Application of the Variational Bayes Inference Theorem 
Based on the above, we now apply 

p&t) oc exp(Jg(0 v )lnp(y,0)d0 v ) (9.33) 

for each variational parameter distribution q(a),q(A x ),q(b), and <7(/l y ). Likewise, we ignore all terms in (9.33) 
which are independent of the variable under consideration, because these terms influence the respective variational 
probabilities independent of the variable under consideration. 

Update equation for q(a) 

We have 

lnq(a) a -\Q7 t=2 X x x t x t - 2aJJ =2 A x x t _ 1 x t + a 2 S=2^A-i^-i>,(i+i) fe . 1:t ) g (i)(y ~ ^|( a2 _ 2a ^° + (^f) 

(9.34) 

Partitioning the expectations, we obtain 

In q(a) OC — - Et=2Ux^t x t)q(x t ) q (^) _ 2- a YJt=2(^x x t-l x t)q(_x t _ lt )q(_A x ) + ^ Zt=2Ux x t-l x t-l) q (x t _ 1 ) q (^) _ ~~ + (Ma"*) J 

(9.35) 

Rewriting the above as a quadratic form in a yields and ignoring terms independent of a yields 

lnq(a) « - \ (lt=2<^t-i^t-i)q(z t _ 1 )q(A x ) + ^2) a 2 - (St=2<^t-i*t>«i&c t _ 1:t )(i(A x ) + ^jr) a ( 9 - 36 ) 
According to the completing-the-square theorem (Supplement Section 2), 

exp [-\aa 2 - /?a) a N(a; a -1 /?, (9-37) 

we thus have 

q^(a)KN(a;d +1 \of +1) ) (9.38) 

where 

<*t } : = (l!t=2Ux^-l x t-l>qa+i)( Xt _ 1 )q(A ;c ) + ~^o) ( 939 ) 

and 
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: = of +1) (z^-AW^y + ^r) 0-40) 

Update equation for q(A Y ) 

From (9.33), we have, using the expectation partitioning as above and reordering in terms of \nA x and A x in we 
have 

lnq( i+1 )(l z ) oc (^ + a£> - l)lnl z - ^(E[= 2 ^t^t> q (x t ) - 2 Et=2<a*t-i*tW- l!t )<j(a) + It=2<a 2 *t-i*t-i>(j(a)(jC*t-i)) + J^)*x 

(9.41) 

Taking the exponential on both sides then yields 

(9.42) 

Up to a normalization constant, <? ( -' +1 ' ) (^x) is thus given by a Gamma distribution 

q« +1 \A x ) a A a t 1} exp (-^-) (9.43) 



with 



and 



a a+D = Izi +a (0 (9.44) 

^x 2 ^x 17 



^ = \J® + \ (£t=2<*t*t><i(x t ) ~ 2 Et=2<a*t-l*t><i(x t _ l!t )<i(a) + St=2<a 2 ^t-l^t-l)q(a)q(x t _ 1 )) J < 9 - 45 ) 

Update equation for qffo) 
We have 

lnq(ft) a -\A y ^J t=x y t y t - 2b Yj t=1 y t x t + b 2 U=iX t x t ) q(i+1){Xt _ i t)q(i) ^ - ^ [b 2 - 2b[if + (^°) ) 

(9.46) 

Partitioning the expectations, we obtain 

\nq(b) a -i A y I[ =1 y t y t - 2M y I[ =1 y t <x t )q (%t) + 6% £t=2<*t*t>qO t ) - i^( ft2 ~ 2 ^£° + ) 

(9.47) 

Rewriting the above as a quadratic form in b yields and ignoring terms independent of b yields 

\nq(b) a - \ (a v Jj^^ + ^) ft 2 - (l y ZLi y t <*tWt) + ^) & 0-48) 
According to the completing-the-square theorem, 
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exp {~\ab 2 - fib) <x N(b; a -1 /?, a" 1 ) (9.49) 

we thus have 

q^\b)KN(b;£ +1 \of +1) ) (9.50) 



where 
and 

,(0 

Update equation for q(A y ) 



^ a+1) == (A y ZL 2 <*t*t W t ) + 1 (9.51) 



M a + D := ^ 2 a + D ^ sLiyt<Xt)q(xt) + <|) (9 .52) 



From (9.33), we have, using the expectation partitioning as above and reordering in terms of \nA x and A x in we 
have 

\nq( i+1 \A y ) oc (1=1 + af y - l)lnA y - ^(2F=iy t y c - 2bX y Y7 t=1 y t {x t ) q{xt) + fe 2 A y 2X 2 <*t*tW t) ) + j^j^y 

(9.53) 

Taking the exponential on both sides then yields 

9 a+1) (Ay) K A y 2 Ay exp ^ + \(Xt =1 y t yt - 2M y £[=i y t {x t ) q{Xt) ) + b 2 A y YJ t=2 (x t x t ) q(Xt) J A 3 



(9.54) 

Up to a normalization constant, q^ l+1 \A x ) is thus given by a Gamma distribution 

q< i+1 >&)«A<°exp(-^ (9.55) 



with 



and 



4r ) = ^ + ^ (9-56) 



= {^ + \& T t^ytyt-ZbA y Yj t=1 y t {x t ) q ^ (9.57) 

9.4 Evaluation of T(q(d)) 

As in Supplement Section 6, we have 

^(q(^)) = Jq09) lnp(y|t9) dtf + JffoW) - JO;(g(0)||p(i?)) (9.58) 
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Evaluation of the average energy term/ q(-Q) lnp(y | i9) dd 
In the current context, the average energy term is given as 

f q(ti)\np(y\ti) cW = (lnp(y 1:T |x 1:T ,a,A x ,ft,A y )) £?(;Ci:T)(?(a)£?(A;c)£?(&)£?(Ay) (9.59) 
= (ln(rit=i P(y t \x t , b, A y ) Y\I=2 P(x t |x t _ 1( a, A x ))) q ( Xl:T MaM* x M» q (h) 

= (Zt=llnp(yt|^t<^<^y)>q(x 1:7 .)q(ft)q(A y ) + &t=l ln P( x t l*t-l< a < ^x))q(x 1:T )q(a)q(A x ) 

= SLi (In (g£) 2 exp (- ^ (y t - bx,) 2 )^ q{XvT)q{m{Xy) 
+ St=i (In (g) 2 exp (- ^ (x t - ax t _{) 2 ^) q{XvT)q{a)qax) 

= Zt=i(^lnA x - iln27r - ^x t 2 - ax t x t _ x - a 2 x?) q(Xi:T)q(a)qax) 
+ It=i <^ln A y - j In 2tt - ^y t 2 - y t bx t - b 2 xf) q(Xi TMb)q(?Ly) 

= \{\nA x ) q{?Lx) - ^ =1 (A X X 2 ) q {x t ~)q{X x ) 

-Zt=i(ax t x t _ 1 )q (a) q (Xt i:t) - U =1 (a 2 x 2 ) q{a)q{Xt) - 
+ ^(\nA y ) q(Ay) - -JJ t=x y 2 t a y )q (Ay ) 

- Zt=iyt(fr*t}q(b)qO t ) _ Zt=l(^ 2x t >q(ft)q(x t ) ~ ~^2n 

Considering the remaining integral terms in turn then yields 



Qn*x)q(?L x ) = Jq{A x )\nA x dx = (ifj(a A J + ln6 A J (9.60) 



b) 



C***t ><i(A»)<i(x t ) = If q{A x )q(x t )A x x' 2 dx t dA x (9.61) 
= U U q{A x )q{x t )A x x 2 dx t ) dA x ) 
= (I (9(^)4 I q(*t)*t <^^t) d^) 

= (l (qUx)Ax(Mx t | yi:T + CT l t |y 1: r)) dA *) 
= Q*x t \y 1 . T + a x t \y 1:T ) f q&x)Ax dA x 
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= /Cf qicOqixt-^axtXt^ da) dx t _ vt 
= Jiqixt^XtXt^ J q(a)a da) dx t _ vx 

= Mo(Iqfe-l:t)^^t-l dx t _ 1:t ) 
= i"a(J(J <70t-l:t)*t*t-l <**t-l) ^X t ) 

= Va(f(q(x t )x t J" qOt-tK-i dx t _i) dx t ) 
= Ma/^tlynrA^t-ilynr 

(a 2 *t W)<?(* t ) = If q(aMx t )a 2 x 2 dadx t 

= J (J q(a)q(x t )a 2 x 2 da)dx t 
= J(q(x t )x 2 J q(a)a 2 da)dx t 
= il4 + ^a) J(q(x t )x t 2 )dx t 
= 04 + ^a)(<|y 1:T + <|y 1:r ) 

(In A y )g (Ay) = (i/> (a Ay ) + In fc Ay ) 

(bx t ) q{Xt)q{b) = JJ q(x t )q(b)bx t dbdx t 

= JCf qi.x t )q{b)bx t db)dx t 

= S(q(.x t )x t J q(b)b db)dx t 

= V-b ${q{x t )x t )dx t 
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h) 

{b 2 Xt)q(x t )q(b) = If ' qi.x t )q{b)b 2 x 2 dbdx t (9.67) 
= f(fq(x t )q(b)b 2 x 2 db)dx t 
= f(q(x t )x 2 fq(b)b 2 db)dx t 

Substitution of results a)-h) into (9.58) then yields an expression of the average energy in terms of the parameters 
governing the unobserved variables and the data as follows: 

(\np(y 1:T \x 1:T> a, A x , b, ^ y )) q{Xl . TMaMAxMb)q ^ = \ (OK) + 1" hj) (9-68) 

-2^t=l( a Pi x h x 04 t]yi . T + CT £|y 1:T )) 

_ Zt=l(A t aA J x t |y 1:7 .A J x t _ 1 |y 1:T ) 

-It=l((^ + O-a)04|y 1:r + <|y 1:r )) 
T 

— In 2ti 

2 



+ 



K^K) + ln^) 



_ Zt=iytA t ftA t x t |y 1:7 . 



-ln27r 

2 



= I (OK) + In J + ^ (%) + ^ b h ) 



a A x b A x v r 2 I 2 

2 2it=lftc t | yi:I . + a x t \y 1: - 



' Et=l(MaMx t |y 1 . 7 -Mx t _ 1 |y 1 . 7 -) 
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+ °l) l T t=lf4 t \y 1:T + <|y 1;r 

2 Zit=i7t 
Zt=iytMx t |y 1:7 . 
-(^ + ^)It=l04 t |y 1:r + <|y 1:T ) 

-Tin 2tt 

Evaluation of the entropy term K(q(-6)) 

Due to the factorization properties of the variational approximation, the entropy term K (q($)) decomposes into a 
sum of entropies as follows: 

K (<?09)) = K (q(.x 1:T Ma)qa x Mb)q(A y )) (9.69) 

= K (q(x 1:T )) + K(q(a)) + K(A X ) + K(q(b)) + Jf (q(A y )) 
Evaluating the terms in (9.69) in turn then yields 

H(q(x 1:T )) = KtJlLz <?Ot-i, x t )) (9.70) 

= Zt=2^(^( x t-l:t;i"x t _ 1:t |y 1:7 .<^x t _ 1:t |y 1:7 .) ) 

The entropy of the unobserved variables x 1:T is thus given as the sum of the differential entropies of adjacent 
variables x t _ x and x t , i.e. as the sum of the differential entropies of T — 1 bivariate Gaussian distributions. We thus 
have 

X(q(x 1:T )) = (T- 1)(1 + ln27r) + i2F=2ln|^ t _ 1:t |y 1:T | (9.71) 

Further, we have 

7£{q{dj) = X(N{a;n a ,o*)) = \\no 2 a + \n(2n)) (9.72) 

K(q(A x )) = H(G(A X ; a Xx , b x J = In Y{a Xx ) - {a Xx - l)^(a A J - In b Xx + a Xx (9.73) 
K(q(b)) = H(N(b;n b ,aZ)) = + \{\ + ln(») (9.74) 

■H (q{Xy)) = H (G(A y ; a Xy , b Xy ) = In Y{a Xy ) - (a Xy - l) ip (a Xy ) - In b Xy + a Xy (9.75) 
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Evaluation of the KL divergence term X£(q(-6) \ | pQ9)) 



We first note that due to the factorization properties of <7(i9), the KL divergence term can be rewritten as the sum of 
KL divergence terms over unobserved variable partitions as follows 

_3C£(q&)\\P&)) = ^(q(a: 1:r )q(a)q(A 3e )q(6)q(A y )||p(a: 1:r )p(a)p(A 3e )p(6)p(A y )) (9.76) 

= Xl(qix 1:T )\\p(x 1:T )) 
+KL(q(a)\\p(a-))+XL(q(A x )\\p(A x )) 
+KL{q{b)\\ V {b)) + KL{q{l y )\\ V {A y )) 
Using the results in (Penny, 2001), we can then evaluate the respective KL divergence terms in turn as follows 

KL{q{x 1 , T )\\p{x 1 , T )) = SCCCII^ <?(*t-i.*t) II Ilt=2P(*t-i>*t)) ( 9 - 77 ) 

= EL2^c(<?(*t-i'*t)llp(*t-i»*t)) 

= SF=2 XL (N (x t _ 1:t ; ^l^y^, 4?_ 1:t |y 1: r) I \ N { x t-i*. ^ 1:t |y 1:T ' Z S 1:t |y 1: r)) 

Further, we have 

XLCqCa)\\p(a)) = Kl[n (a; ^ ,af } ) ||N (a;^ 0) ,a a 2(0) )) (9.78) 

KL(q(A x )\\p(A x )) = Xl(g (a x ; a®, ft®) | |G (a,;^,^)) (9.79) 

JCCfotfO | |p(6)) = X£ (JV (a; p®, <x 6 2(0 ) | \N (a; pf, a| (0) )) (9.80) 

KL(q(A y )\\pCA y )) = Kl(g (A y ;a®,fc®) ||C (A y ; a®, 6 y 0) )) (9.81) 

9.5 Probability density function transformations 

To estimate a posterior distribution over the latent SDE parameters a £ R and a £ R +l we used the Euler- 
Maruyama approximation, which entails the use of two transformation mappings from a and a onto the LGSSM 
parameter a £ R and cr 2 £ E+, as discussed in Supplement Section 3. If we obtain posterior pdfs over a and o x 
based on VB inference for the LGSSM and would like to use them for posterior inference over the SDE parameters a 
and a, we have to respect the transformation theorem for pdfs, as discussed below. 

General transformation theorem 

In general, we have for a probability measure P on the Borel a -field ® with pdf / of x and a measurable linear-affine 
transformation given by 

T: R -> R,y := T(x) (x £ R n ) (9.82) 

the following pdf g for y 
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yyyj |T'(T _1 (y))| 
Transformation theorem for pdf over a to pdf over a 

For the special univariate linear affine transformation case, we have with a, b £ R, a 

T: R -> R,y := T(x) := ax + b (x G R) 

the following pdf g for y : 

Above, we have seen that the inverse mapping of the transformation from a to a is given by 

T t :R 

Substitution thus yields for the posterior pdf g over a, given the posterior pdf / over a 

g(a) = At • /(At(a + At)) (a £ 1R) 
Transformation theorem for pdf over X v to pdf over a 
Above, we have seen that 



1 11 
iL, a i-> (a) = a = — (a — 1) = — a — — 
1V J At v y At At 



7 2 : R + -» ~ T 2 (A X ) = AtA x 



with derivative 



and inverse 



Substitution in (9.83) thus yields 



T 2 '(^) = -iAtV 



77 1 : E + -» R + ,a » 77V) := (^)' 



Jfjl /((f) 2 ) 2 -/((f) 2 ) M®') 



Transformation theorem for pds over A y to pdf over gg 
Obviously, we have 

r 3 :E + ^M + ,^^T 3 (4) = A- 1 

with derivative 
and inverse 
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Tf 1 : R + -» R + , ct 2 ~ ^(ay ) = (<7 2 )~ 1 (9.94) 
Substitution in (9.83) thus yields 

^ = \7T^h\ = (^H (9 " 95) 
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